Keywords: representation learning, compositional generalization, generative models
Abstract: Visual perception in the human brain is often thought to arise from inverting an internal generative model. In contrast, today’s most successful machine vision models are non-generative, relying on an *encoder* and not a generative *decoder*. This raises the question of whether generation is required for machines to achieve human-level visual perception. In this work, we address this question from the perspective of data efficiency, a core feature of human perception. Specifically, we investigate whether *compositional generalization* to out-of-domain (OOD) images is achievable, both in theory and practice, using generative and non-generative methods. We first formalize the inductive biases required to guarantee compositional generalization in generative (decoder-based) and non-generative (encoder-based) methods. We then provide theoretical results suggesting that such inductive biases cannot be enforced on an encoder through practical means such as regularization or architectural constraints, and thus compositional generalization cannot be guaranteed. In contrast, enforcing the inductive biases on a decoder is straightforward, enabling compositional generalization through inverting the decoder. We highlight that this inversion can be performed efficiently for OOD images, either online through gradient-based search or offline through generative replay. Empirically, we train a variety of non-generative methods on image datasets with concepts such as animals and backgrounds, and find that they tend to fail to generalize compositionally in a data-efficient manner. By comparison, generative methods, which leverage search and replay, yield significant gains in OOD performance.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 20239
Loading