TL;DR: We present a way of using generative models for scene understanding tasks with strong out-of-distribution generalization
Abstract: Generative models have demonstrated remarkable abilities in generating high-fidelity visual content. In this work, we explore how generative models can further be used not only to synthesize visual content but also to understand the properties of a scene given a natural image. We formulate scene understanding as an inverse generative modeling problem, where we seek to find conditional parameters of a visual generative model to best fit a given natural image. To enable this procedure to infer scene structure from images substantially different than those seen during training, we further propose to build this visual generative model compositionally from smaller models over pieces of a scene. We illustrate how this procedure enables us to infer the set of objects in a scene, enabling robust generalization to new test scenes with an increased number of objects of new shapes. We further illustrate how this enables us to infer global scene factors, likewise enabling robust generalization to new scenes. Finally, we illustrate how this approach can be directly applied to existing pretrained text-to-image generative models for zero-shot multi-object perception. Code and visualizations are at https://energy-based-model.github.io/compositional-inference.
Lay Summary: Recent advances in AI have made it possible for computers to generate incredibly realistic images. But what if we could also use these generative tools to help computers better understand what's in a photo—like identifying all the objects or describing how a scene is structured?
We explore exactly that: it shows how we can reverse the usual process of image generation to instead analyze real-world scenes. Think of it like using a recipe (the generative model) to figure out the ingredients in a dish (the photo), even if you've never tasted that exact dish before.
The key idea is to break down complex scenes into smaller parts, using a modular approach—like identifying each object in a room one by one. This makes it easier for the computer to recognize new and unfamiliar scenes by reusing these smaller pieces.
We tested our method in several ways:
Detecting objects in images with more complexity than the system saw during training—for example, images with more items, different shapes, or new backgrounds.
Describing people’s faces across gender differences, even when trained on only one gender.
Using powerful image-generation tools to recognize multiple objects in completely new web images, without needing additional training (a concept known as "zero-shot learning").
Overall, our work offers a new way to teach computers to understand the world more like humans do—by learning in flexible, reusable parts—and it shows strong results in both simple and complex settings.
Primary Area: Deep Learning
Keywords: Scene Understanding, Compositional Generative Modeling, Generalization
Submission Number: 7907
Loading