TL;DR: Standard networks given separate x/y cues still memorize and stitch examples rather than combine them. Forcing each cue into the output space—via low-rank embeddings or simple stripe data—enables true compositional generalization.
Abstract: Composition—the ability to generate myriad variations from finite means—is believed to underlie powerful generalization. However, compositional generalization remains a key challenge for deep learning. A widely held assumption is that learning disentangled (factorized) representations naturally supports this kind of extrapolation. Yet, empirical results are mixed, with many generative models failing to recognize and compose factors to generate out-of-distribution (OOD) samples. In this work, we investigate a controlled 2D Gaussian "bump" generation task with fully disentangled $(x,y)$ inputs, demonstrating that standard generative architectures still fail in OOD regions when training with partial data, by re-entangling latent representations in subsequent layers. By examining the model's learned kernels and manifold geometry, we show that this failure reflects a "memorization" strategy for generation via data superposition rather than via composition of the true factorized features. We show that when models are forced—through architectural modifications with regularization or curated training data—to render the disentangled latents into the full-dimensional representational (pixel) space, they can be highly data-efficient and effective at composing in OOD regions. These findings underscore that disentangled latents in an abstract representation are insufficient and show that if models can represent disentangled factors directly in the output representational space, it can achieve robust compositional generalization.
Lay Summary: Humans effortlessly mix a few simple pieces—like words or shapes—to create endless new ideas, but computers usually must see every example to learn. To investigate, we asked an AI to draw a single shaded dot on a blank grid anywhere it was told, except we hid the center area during training. Even when we gave the exact "x" and "y" instructions, the AI simply stitched together bits of remembered examples instead of learning the underlying rule for placing the dot. Then we tried two small tweaks: one that makes the AI paint each instruction directly onto the final grid, and another that first teaches it simple horizontal and vertical lines. With either tweak, the AI truly learned to combine the two directions and instantly filled in the missing center—using far fewer examples. This shows that grounding each piece of information right where the AI acts can help future systems flexibly recombine known elements—whether for new word combinations, object layouts, or routes—without needing to relearn every possibility.
Link To Code: https://github.com/qiyaoliang/DisentangledCompGen
Primary Area: Deep Learning->Everything Else
Keywords: Factorization, Compositionality, Compositional Generalization, Data Efficiency
Submission Number: 16212
Loading