Researchers make it easier to visualize 3D scenes from photos
By Andrew Clark
A new approach is making it easier to visualize lifelike 3D environments from everyday photos already shared online, opening new possibilities in industries such as gaming, virtual tourism and cultural preservation.
Hadar Averbuch-Elor, assistant professor at Cornell Tech, is part of the research team behind “WildCAT3D,” a new framework that significantly expands the possibilities of novel view synthesis (NVS), a technique that creates realistic angles of a scene from just a single existing photo.
The work, which was presented Dec. 4 at the Conference and Workshop on Neural Information Processing, focuses on a key limitation in current 3D image-generation technology: Most systems can only learn from small, carefully curated datasets that look nothing like the messy, inconsistent images people actually take and share online.
WildCAT3D shows how computers can be trained using large collections of freely available images – tourist snapshots; photos taken in different weather, lighting and seasons; or partially obscured scenes. These are exactly the kinds of images that could power applications such as virtual tourism, video games, historical preservation and immersive mapping, but they have traditionally been too inconsistent for use in existing models.
“The main challenge was how to design a multi-view diffusion model that can learn from in-the-wild internet collections, where scene observations exhibit significant variations – for example, in illumination, weather, transient objects and so on,” said Averbuch-Elor, who is also affiliated with the Cornell Ann S. Bowers College of Computing and Information Science.
WildCAT3D helps artificial intelligence focus on what matters in a scene. Rather than becoming confused by changes in lighting, weather or camera angle, the system learns to recognize the stable structure of a place while treating such visual differences as transient details.
This approach makes it far more useful in real-world settings. WildCAT3D can take a single photo and generate multiple, realistic views of the same place, making it possible to “walk around” a scene that was only photographed once. This capability opens the door to richer virtual-tourism experiences, more immersive video games and more accurate digital reconstructions of real-world locations.
It also allows creators and researchers to easily explore how a scene might appear under different weather and lighting conditions. That flexibility is especially valuable for preserving cultural landmarks, planning environments before they are built or restored, and creating realistic virtual spaces without the need for expensive, carefully controlled photo shoots.
Averbuch-Elor sees this work as a step toward making high-quality 3D scene creation more accessible, allowing anyone with ordinary photos – and not just specialized teams with custom datasets – to build realistic digital worlds.
“We hope that our work catalyzes a shift toward 3D-consistent generative frameworks that learn from permissively licensed internet data directly, reducing the field’s reliance on heavily curated multi-view datasets,” she said.
This research was carried out in collaboration with researchers from Meta AI and Tel Aviv University.
Andrew Clark is a writer for Cornell Tech.
Media Contact
Get Cornell news delivered right to your inbox.
Subscribe