Predicting LLM Reasoning Performance with Small Proxy Model

Woosung Koh; Juyoung Suk; Sungjun Han; Se-Young Yun; Jay Shin

Predicting LLM Reasoning Performance with Small Proxy Model

Woosung Koh, Juyoung Suk, Sungjun Han, Se-Young Yun, Jay Shin

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language Models, Pre-training, Reasoning, Evaluation, Efficiency

TL;DR: We enable small proxy models to reliably predict large model reasoning performance using next-token prediction on reasoning traces with task-aligned weighting, dramatically reducing pre-training recipe search cost.

Abstract: Given the prohibitive cost of pre-training large language models, it is essential to leverage smaller proxy models to optimize recipes before scaling up. However, this approach becomes challenging for reasoning capabilities, which exhibit \textit{emergent} behavior that only appears reliably at larger model sizes, often exceeding 7B parameters. To address this, we introduce \tsc{rBridge}, showing that small proxies ($\leq$1B) can effectively predict large-model reasoning by aligning more closely with \textbf{(1)} the pre-training objective and \textbf{(2)} the target task. \tsc{rBridge} achieves this by weighting negative log-likelihood with task alignment, using reasoning traces from frontier models as gold labels. In our experiments, \tsc{rBridge} \textbf{(i)} reduces dataset ranking costs by over 100$\times$ relative to the best baseline, \textbf{(ii)} achieves the strongest correlation across six reasoning benchmarks at 1B to 32B scale, and \textbf{(iii)} transfers predictive relationships across pre-training recipes at 1B to 7B scale. These findings indicate that \tsc{rBridge} offers a practical path for exploring reasoning-oriented pre-training at lower cost.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 5590

Loading