A NVIDIA AI breakthrough will allow developers and artists to create new interactive 3D virtual worlds by training models on videos from the real world for the first time, a developed that could prove significant for computer vision, robotics and graphics.
It is the first time neural networks have been used with a computer graphics engine to render new, fully synthetic worlds, say NVIDIA researchers, who demonstrated it via a driving simulator powered by a single high-end NVIDIA GPU this week.
They also used it to create a relatively convincing avatar of one of the paper’s co-authors and using the same techniques as the driving simulator (training it on a video of a woman dancing the Korean song “Gangnam Style”) synthesised her moves.
The company’s Ting-Chun Wang described the ability as allowing developers to “rapidly create interactive graphics at a much lower cost than traditional virtual modeling.”
In a research paper [pdf] detailing the innovation, the company’s researchers wrote: “Learning to synthesize continuous visual experiences has a wide range of applications in computer vision, robotics, and computer graphics…”
They added: “Using a learned video synthesis model, one can generate realistic videos without explicitly specifying scene geometry, materials, lighting, and dynamics.”
The tool was presented at the NeurIPS conference in Montreal, Canada this week. NeurIPS (previously known as NIPS)is one of the top annual gatherings for people working on the cutting edge of AI and machine learning.
Called vid2vid, the AI model behind this demo uses a deep learning method known as GANs to render photorealistic videos from high-level representations like semantic layouts, edge maps and poses, NVIDIA said in a blog.
Bryan Catanzaro, vice president of Applied Deep Learning Research at NVIDIA, who led the team developing this work, said in a blog: “Neural networks — specifically generative models — will change how graphics are created. This will enable developers to create new scenes at a fraction of the traditional cost.”
As the deep learning NVIDIA AI network trains, it becomes better at making videos that are smooth and visually coherent, with minimal flickering between frames. The researchers’ model can synthesize 30-second street scene videos in 2K resolution.
See also: The Deepfake Threat
By training the model on different video sequences, the model can paint scenes that look like different cities around the world.
The researchers added: “Our method also grants users flexible high-level control over the video generation results. For example, a user can easily replace all the buildings with trees in a street view video. In addition, our method works for other input video formats such as face sketches and body poses, enabling many applications from face swapping to human motion transfer.”
The company has made the code available on GitHub.