AI video is fast turning from uncanny valley to genuinely realistic, and Google’s Lumiere is the most sophisticated text-to-video generator we’ve seen to date.
Evoking a sense of awe – and a hefty dose of unease – Google recently exhibited how sophisticated AI video has become in just a few years of development.
In the same way that text-to-image generators like Bing Image Creator, DALL-E, and Midjourney can create original images from a single-line prompt, Google’s ‘Lumiere’ application can turn our wildest ideas into fully rendered five second videos.
Other examples of text-to-video generators are already available, granted, but Google’s attempt is the first to really nail an accurate portrayal of movement to a near CGI standard.
It achieves this by establishing a base frame and using its highly touted STUNet (Space-Time-U-Net) technology to autonomously establish where are how items in the image should move. Once selected, objects within that initial frame then comprise several layers of their own that flow into each other seamlessly.
Lumiere is able to generate 80 frames per image compared to the previous maximum of 25 achieved by its closest competitor Stable Video Diffusion. Though several early results released by Google have a touch of artificiality about them, the leap in overall quality since its 2022 demo is staggering.
Beyond text-to-video, there is also image-to-video generation which will bring a still picture to life, stylised generation, which can create videos in a specific visual style, and a cinemograph setting able to animate a specific portion of an existing image – like flowing water, a flickering fire, or smoke from a train engine, for instance.
In terms of market strategy, the late arrival of Lumiere falls in line with Google’s fashionably late policy. Since the early iteration of its generative language tool Bard flopped last year, the tech giant has quietly developed its multimodal vision for generative AI in the background.