How does RL Post-training Induce Skill Composition? A Case Study on Countdown

Published: 27 May 2026, Last Modified: 27 May 2026CompLearn 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: compositional generalization, RL post-training
Abstract: While reinforcement learning (RL) successfully enhances reasoning in large language models, its role in fostering compositional generalization (the ability to synthesize novel skills from known components) is often conflated with mere length generalization. To this end, we study what RL post-training teaches models about skill composition and how the composition structure affects the skill trasnfer. We focus on the COUNTDOWN task (given $n$ numbers and a target, form an expression that evaluates to the target) and analyze model solutions using expression trees, where each subtree corresponds to a reusable subtask and thus can be viewed as a ``skill.'' Tracking tree shapes and their success rates over training, we find: (i) out-of-distribution (OOD) generalization to larger and to unseen tree shapes, indicating compositional reuse of subtasks; (ii) the order of skill acquisition depends upon the composition structure---models master shallow balanced trees (workload is balanced between subtasks) before deep unbalanced ones, with persistent fragility on right-heavy structures for a fixed composition depth. Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal. To study the implication of our findings for more real-world tasks, we present a preliminary study on formal theorem proving.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 145
Loading