Researchers at Nvidia say they have created a rendering framework that can produce 3D objects from 2D images, with the correct shape, color, texture and lighting; a framework that can help machine learning models achieve depth perception.
The rendering framework called DIB-R — a differentiable interpolation-based renderer — produces 3D objects from 2D images and was presented this week at the annual conference on Neural Information Processing Systems in Vancouver, Canada.
The framework, when wrapped around a neural network, learns to predict shape, texture, and light from single images and generate 3D shapes from a photo.
In the paper presented this week the researchers (from Nvidia, the University of Toronto, Vector Institute, McGill University and Aalto Universit) noted: “Many machine learning models operate on images, but ignore the fact that images are 2D projections formed by 3D geometry interacting with light, in a process called rendering…
“Enabling machine learning models to understand the image formation process could facilitate disentanglement of geometry from the lighting effects, which is key in achieving invariance and robustness.”
DIB-R 2D to 3D Rendering
DIB-R uses an encoder-decoder architecture to transform the input data from the 2D image into a feature map that is then used to predict the image outcome.
DIB-R takes a polygon sphere and alters it to the point that it represents the 2D image it is trying to reproduce in 3D. The researchers trained the model using a number of image datasets from a collection of bird photos to images of vehicles.
It could potentially be used by archaeological researchers to create 3D images of objects that have been discovered and imaged during excavations.
Using a single NVIDIA V100 GPU it takes just two days to train the model, once trained DIB-R can create a 3D object based on the data of a 2D image within a 100 milliseconds. DIB-R is built on the machine learning framework PyTorch.
The researchers noted that the: “Key to our approach is to view foreground rasterization as a weighted interpolation of local properties and background rasterization as an distance-based aggregation of global geometry. Our approach allows for accurate optimization over vertex positions, colors, normals, light directions and texture coordinates through a variety of lighting models.”