Self-Augmented Learning of Differentiable Object Models for Compositional Interpretation of Complex Scenes

Published: 23 Sept 2025, Last Modified: 19 Nov 2025SpaVLE PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: scene understanding, object-centric learning, computer vision, physics-based ML, representation learning, differentiable rendering
TL;DR: We present DVP+, an autoencoder for interpreting 2D scenes, which decomposes them into multiple objects, employ a differentiable renderer for reconstruction, and leverages novel self-augmented training strategies.
Abstract: This study builds on the architecture of the Disentangler of Visual Priors (DVP), a type of autoencoder that learns to interpret scenes by decomposing the perceived objects into independent visual aspects of shape, size, orientation, and color appearance. These aspects are expressed as latent parameters which control a differentiable renderer that performs image reconstruction, so that the model can be trained end-to-end with gradient using reconstruction loss. In this study, we extend the original DVP so that it can handle multiple objects in a scene and propose effective training strategies. To address the challenges of optimizing in presence of a differentiable renderer, we exploit the interpretability of the latent by using the decoder to generate self-augmented training examples and devising alternative training modes that rely on loss functions defined not only in the image space, but also in the latent space. This significantly facilitates training, which is otherwise challenging due to the presence of extensive plateaus in the image-space reconstruction loss. We compare our approach with two baselines (MONet and LIVE) on a new benchmark which subsumes the previously proposed Multi-dSprites and demonstrate its superiority in terms of reconstruction quality and capacity to decompose overlapping objects. We also analyze the gradients induced by the considered loss functions, explain how they impact the efficacy of training, and discuss the limitations of differentiable rendering in autoencoders and the ways in which they can be addressed.
Submission Type: Long Research Paper (< 9 Pages)
Submission Number: 11
Loading