Keywords: Object-Centric Representation Learning, Concept Learning
Abstract: We present Language-mediated, Object-centric Representation Learning (LORL), learning disentangled, object-centric scene representations from vision and language. LORL builds upon recent advances in unsupervised object segmentation, notably MONet and Slot Attention. Just like these algorithms, LORL also learns an object-centric representation by reconstructing the input image. But LORL further learns to associate the learned representations to concepts, i.e., words for object categories, properties, and spatial relationships, from language input. These object-centric concepts derived from language facilitate the learning of object-centric representations. LORL can be integrated with various unsupervised segmentation algorithms that are language-agnostic. Experiments show that LORL consistently improves the performance of MONet and Slot Attention on two datasets via the help of language. We also show that concepts learned by LORL aid downstream tasks such as referential expression interpretation.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
One-sentence Summary: We present a framework for learning disentangled, object-centric scene representations from vision and language.
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/arxiv:2012.15814/code)
Reviewed Version (pdf): https://openreview.net/references/pdf?id=V_bF6lx5M
17 Replies
Loading