DisFormer: Disentangled Object Representations for Learning Visual Dynamics Via Transformers

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Unsupervised Visual dynamics prediction, object centric representation, disentangled representation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose a novel approach for learning disentangled object representations for the task of learning visual dynamics via transformers.
Abstract: We focus on the task of visual dynamics prediction. Recent work has shown that object-centric representations can greatly help improve the accuracy of learning such dynamics in an unsupervised way. Building on top of this work, we ask the question: would it help to learn disentangled object representations, possibly separating the attributes which contribute to the motion dynamics vs which don’t? Though there is some prior work which aims to achieve this, we argue in this paper either it is limiting in their setting, or does not use the learned representation explicitly for predicting visual dynamics, making them sub-optimal. In response, we propose DisFormer, an approach for learning disentangled object representation and use them for predicting visual dynamics. Our architecture extends the notion of slots Locatello et al. (2020) to taking attention over individual objectrepresentations: each slot learns the representation for a block by attending over different parts of an object, and each block is expressed as a linear combination over a small set of learned concepts. We perform an iterative refinement over these slots to extract a disentangled representation, which is then fed to a trans- former architecture to predict the next set of latent object representations. Since our loss is unsupervised, we need to align the output object masks with those ex- tracted from the ground truth image, and we design a novel permutation module to achieve this alignment by learning a canonical ordering. We perform a series of experiments demonstrating that our learned representations help predict future dynamics in the standard setting, where we test on the same environment as train- ing, and in the setting of transfer, where certain object combinations are never seen before. Our method outperforms existing baselines in terms of pixel prediction and deciphering the dynamics, especially in the zero-shot transfer setting where existing approaches fail miserably. Further analysis reveals that our learned representations indeed help with significantly better disentanglement of objects compared to existing techniques.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9419
Loading