Efficient Object-Centric Representation Learning using Masked Generative Modeling

TMLR Paper4762 Authors

30 Apr 2025 (modified: 30 Jun 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Learning object-centric representations from visual inputs in an unsupervised manner have drawn focus to solve more complex tasks, such as reasoning and reinforcement learning. However, current state-of-the-art methods, relying on autoregressive transformers or diffusion models to generate scenes from object-centric representations, suffer from computational inefficiency due to their sequential or iterative nature. This computational bottleneck limits their practical application and hinders scaling to more complex downstream tasks. To overcome this, we propose MOGENT, an efficient object-centric learning framework based on masked generative modeling. MOGENT conditions a masked bidirectional transformer on learned object slots and employs a parallel iterative decoding scheme to generate scenes, enabling efficient compositional generation. Experiments show thatMOGENT significantly improves computational efficiency, accelerating the generation process by up to 67x and 17x compared to autoregressive models and diffusion-based models, respectively. Importantly, the efficiency is attained followed by a strong or competitive performance on object segmentation and compositional generation tasks.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: In response to the comments from the reviewers, we have added the following modifications to the current manuscript. - We have added experiments on more realistic datasets, CLEVRTex and CelebA, showing that MOGENT can effectively learn object-centric representations for these datasets as well. Furthermore, we have added a diffusion-based baseline, SlotDiffusion, for these datasets to quantitatively compare efficiency. - We have added experiment on unconditional generation, which provides insights on our model's capabilities and highlights directions for future work. - We have added ablations and comparisons to reinforce our results. - We have improved the overall writing to clarify our contributions.
Assigned Action Editor: ~Grigorios_Chrysos1
Submission Number: 4762
Loading