TL;DR: We study length generalization through the lens of Low-Dimension-to-High-Dimension generalization.
Abstract: Low-Dimension-to-High-Dimension (LDHD) generalization, a subset of Out-of-Distribution (OOD) generalization, involves training on a low-dimensional subspace and testing in a high-dimensional space. Assuming instances are generated from latent variables reflecting problem scale, LDHD generalization captures the inherent scaling challenge of length generalization. We theoretically show that LDHD generalization is unattainable without appropriate inductive bias. Focusing on Boolean functions, we demonstrate that different architectures trained with (S)GD converge to *min-degree interpolators w.r.t. different linearly independent sets*, achieving LDHD generalization only when the target function aligns with this bias. From the perspective of LDHD generalization for length generalization, we explain the success of CoT in restructuring latent space for improved LDHD generalization. We further propose a principle for designing position embeddings to address both LDHD generalization and data format nuisances separately. Following the principle, we introduce RPE-Square, a novel embedding that enhances RPE to better handle data formats.
Lay Summary: Why is it hard for machines to learn to solve large and complex problems by first practicing on smaller and simpler ones? In this paper, we show that there are two main reasons: First, problems often become fundamentally more complex as their size grows. Second, the way these problems are presented to the machine can make learning even harder.
We prove that, in general, no single learning method can guarantee success in scaling up from small to large problems across all tasks. However, when we do have some prior knowledge about a particular problem, we can design machines that are better suited to learn this way.
Our work helps researchers better understand the challenges machines face when generalizing from small to large problem instances. It also offers practical guidance on how to adapt machine designs when additional information about the task is available.
Primary Area: Theory->Deep Learning
Keywords: Length Generalization
Submission Number: 5857
Loading