Task Generalization with Autoregressive Compositional Structure: Can Learning from $D$ Tasks Generalize to $D^T$ Tasks?

Amirhesam Abedsoltan; Huaqing Zhang; Kaiyue Wen; Hongzhou Lin; Jingzhao Zhang; Mikhail Belkin

Task Generalization with Autoregressive Compositional Structure: Can Learning from $D$ Tasks Generalize to $D^T$ Tasks?

Amirhesam Abedsoltan, Huaqing Zhang, Kaiyue Wen, Hongzhou Lin, Jingzhao Zhang, Mikhail Belkin

Published: 01 May 2025, Last Modified: 16 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) exhibit remarkable task generalization, solving tasks they were never explicitly trained on with only a few demonstrations. This raises a fundamental question: When can learning from a small set of tasks generalize to a large task family? In this paper, we investigate task generalization through the lens of autoregressive compositional structure, where each task is a composition of T operations, and each operation is among a finite family of D subtasks. This yields a total class of size~D^T. We first show that generalization to all D^T tasks is theoretically achievable by training on only \tilde{O}(D) tasks. Empirically, we demonstrate that Transformers achieve such exponential task generalization on sparse parity functions via In-context Learning (ICL) and chain-of-thought (CoT) reasoning. We further demonstrate this exponential generalization in arithmetic and language translation, extending beyond parity functions.

Lay Summary: Large language models (LLMs) exhibit a striking ability to solve tasks they were never explicitly trained on. Unlike classical supervised learning—which typically assumes that test data comes from the same distribution as training data—LLMs can generalize to entirely new task distributions given only a few demonstrations. This phenomenon is known as in-context learning (ICL). We study this capability of LLMs through the lens of autoregressive compositionality, focusing on tasks that can be decomposed into $T$ operations, each selected from only $D$ possibilities. Although the total number of tasks grows exponentially $D^T$, we prove—mathematically—that a model trained on $D\cdot\ln(D)$ randomly chosen tasks can generalize to the entire task family. Our experiments with Transformer models support this theory. When prompted using chain-of-thought before answering, the model learns from solving a small set of training tasks to correctly handling many of new ones. We demonstrate this across parity problems, arithmetic, and multi-step language translation.

Primary Area: Deep Learning->Theory

Keywords: In Context Learning, Chain of Thought, Parity problem, Composition Generalization

Submission Number: 9191

Loading