Keywords: representation learning, compositional generalization, generative models
Abstract: Visual perception in the human brain is often thought to result from inverting a generative *decoder* that maps latents to images. In contrast, today’s most successful vision models are non-generative, relying on an *encoder* that maps images to latents without inverting an image decoder. This raises the question of whether generation is required for machines to achieve human-level visual perception. In this work, we approach this question from the perspective of data efficiency, a core feature of human perception. Specifically, we investigate whether *compositional generalization* is achievable, both in theory and practice, using generative and non-generative methods. We first formalize the inductive biases required to guarantee compositional generalization in generative (decoder-based) and non-generative (encoder-based) methods. We then provide theoretical results suggesting that such inductive biases cannot be enforced on an encoder through practical means such as regularization or architectural constraints. In contrast, we show that enforcing the inductive biases on a decoder is straightforward, enabling compositional generalization through inverting the decoder. We highlight how this inversion can be performed efficiently, either online through gradient-based search or offline through generative replay.
Empirically, we train a range of non-generative methods on photorealistic image datasets, finding they often fail to generalize compositionally and require large-scale pretraining to improve generalization. By comparison, generative methods yield significant improvements in compositional generalization, without requiring additional data, by leveraging suitable inductive biases on a decoder along with search and replay.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 20239
Loading