h1: Bootstrapping Models to Reason over Longer Horizons via Reinforcement Learning

ICLR 2026 Conference Submission22165 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: long-horizon training, reasoning, LLMs, post-training, reinforcement learning
TL;DR: We develop a method to improve the long-horizon reasoning capabilities of LLMs by scaling RL using only short-horizon data
Abstract: Large language models excel at short-horizon reasoning tasks, but performance drops as reasoning horizon lengths increase. Existing approaches to combat this rely on inference-time scaffolding or costly step-level supervision, neither of which is scalable. In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multi-step dependency chains of arbitrary length. We then train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating. Empirically, our method generalizes remarkably well: curriculum training on composed 6th-grade math problems (GSM8K) boosts accuracy on unseen, Olympiad-level benchmarks (AIME) by up to 2.65x. Importantly, our long-horizon improvements are significantly higher than baselines even at high pass@k, showing that models can learn entirely new reasoning paths under RL. Theoretically, we show that curriculum-based RL with outcome rewards achieves an exponential improvement in sample complexity over full-horizon training, comparable to the gains from dense supervision, while providing strong training signal without additional annotations. h1 therefore introduces an efficient path towards scaling RL for longer horizons using existing data.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22165
Loading