A unified theory of scene representation learning and object representation learning

16 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Representation Learning, Multi-Object Representation Learning, Algebraic Independence
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We formulated the two problems (1. decomposition of a multi-object scene into single objects and 2. decomposition of a single object into attributes) in a common framework - algebraic independence -.
Abstract: The goal of representation learning is the unsupervised learning of simple and useful representations that model sensory input. Various methods have been proposed in representation learning, but a unified theory has not yet been established. Two problems exist in the representation learning of a visual scene that contains multiple objects: scene representation learning and object representation learning. Scene representation refers to decomposing a single visual scene that contains multiple objects into a combination of multiple individual objects. Object representation refers to decomposing a single object into a combination of multiple attributes, such as position and shape. Scene representation learning and object representation learning have been formulated in different ways in previous studies. Recently, Ohmura et al. (2023) proposed a theory of object representation learning in which transformations between two objects are learned to satisfy algebraic independence so that one attribute of a single object can be transformed while the other remains invariant. In existing methods of object representation learning, independence is often imposed between scalar variables, whereas theory based on algebraic independence successfully weakens the constraint from between scalar variables to between latent vectors. The latent vector is also used to represent an individual object in existing methods of scene representation learning because such a vector can contain more information than the scalar variable. Furthermore, one of the main components of algebraic independence is commutativity. Existing methods of scene representation learning typically represent a visual scene as the sum of multiple object representations, and the sum satisfies commutativity. We focused on the commonalities between object representation learning and scene representation learning: constraints between latent vectors and commutativity. We proposed a unified theory based on algebraic independence that explains both scene representation learning and object representation learning. We validated our theory in experiments on an image dataset that contained multiple objects.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 551
Loading