Keywords: compositional generalization, large language models
Abstract: While RL successfully enhances reasoning in LLMs, its effect on compositional generalization (the ability to synthesize novel skills from known components) is often conflated with length generalization. To this end, we study what RL teaches about skill composition and how the structure of the composition affects the skill transfer. We focus on Countdown (given $n$ numbers and a target, form an expression that evaluates to the target) and analyze model solutions as expression trees, where each subtree corresponds to a reusable subtask and thus can be viewed as a ``skill.'' Tracking success rates per tree shape over training, we find: OOD generalization to larger $n$ and to unseen tree shapes, indicating compositional reuse of subtasks; (ii) a structure-dependent hierarchy of learnability---models master shallow balanced trees (workload is balanced between subtasks) before deep unbalanced ones, with persistent fragility on right-heavy structures (even when the composition depth is the same as some left-heavy structures). Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal. Our insights seem applicable to any domain with a hierarchy of reasoning with structural subtasks, such as program synthesis. We speculate that our structure-aware analysis of task difficulty can yield a more efficient data curriculum beyond simple test-time scaling.
Submission Number: 249
Loading