From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reinforcement learning (RL) has emerged as a key mechanism for transforming LLMs into robust reasoners. While supervised fine-tuning (SFT) often limits models to the distribution of observed reasoning traces, RL post-training significantly improves performance on out-of-distribution (OOD) tasks that require unfamiliar recombinations of familiar steps. We argue that this improvement is driven by **compositional generalization**, which we formalize through a **Hierarchical Latent Selection Model**. In this framework, reasoning traces are generated by a cascade of discrete latent selection variables corresponding to reusable atomic modules, including both skills (local operations) and routing mechanisms (how intermediate information is selected, reused, and composed). We theoretically show that RL’s exploratory nature provides sufficient coverage to identify latent structure and enable compositional generalization. We design controlled experiments to validate this theory. Our results demonstrate that RL can extract atomic modules from compound traces and recombine them to solve new configurations. Moreover, we find that training on compound traces can yield stronger generalization than training on isolated atomic modules. Finally, we investigate relations between SFT and RL and identify an effective protocol in which SFT ensures coverage of all atomic modules, while RL focuses on novel compositions beyond the SFT support to encourage exploration.
Lay Summary: Modern AI language models can follow worked examples, but they may fail when a new problem requires familiar steps in a new order. This paper asks why trial-and-feedback training improves that ability. We view reasoning as built from reusable pieces: simple problem-solving actions and choices about which earlier result to use next. Our theory says feedback can help a model separate these pieces from longer examples and reuse them in new combinations. In simple controlled tasks where the right pieces are known, we find that feedback-based training helps models solve new multi-step problems. Practicing whole multi-step examples works better than practicing single pieces alone. The results suggest a practical recipe: first show examples covering all needed pieces, then give feedback on new combinations.
Primary Area: Deep Learning->Large Language Models
Keywords: reasoning, reinforcement learning, compositional generalization
Originally Submitted PDF: pdf
Submission Number: 19508
Loading