An Object-Attribute Decoupled Approach for Learning Disentangled Representation for Image and Video Analysis
Keywords: object-centric, disentangled object representation
Abstract: Learning disentangled representations for images and videos in terms of objects and their attributes without explicit supervision is an important but challenging task. Recent work~\cite{nsb} extends slot-based techniques for object discovery by decomposing slots into blocks, where each block is expressed as a linear combination of a fixed number of learnable concepts. At its core, this approach couples object and attribute discovery, assuming that image encoders innately learn disentangled features—an assumption we find does not always hold experimentally.
We propose DeCoupler, a method that separates object discovery from attribute discovery by first using foundation models to extract object masks, and then learning block representations that capture attributes across objects. This leads to improved disentanglement, enabling tasks such as attribute-level interventions and dynamics prediction. We demonstrate these capabilities through experiments on five image and two video datasets, showing superior disentanglement and generalization over prior methods.
Submission Number: 37
Loading