When More Tokens Hurt: Saturation Effects in Test-Time Compute Scaling

ACL ARR 2026 January Submission3913 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Token Budgets, LLM Reasoning
Abstract: Recent work has shown that large language models can improve mathematical reasoning performance by allocating additional tokens at test time. However, the relationship between model scale and optimal token budgets remains unexplored. We conduct a systematic study of test-time compute scaling across four model sizes (0.5B to 7B parameters) on the GSM8K mathematical reasoning benchmark, evaluating performance at seven token budgets from 32 to 2048 tokens. We find three key results: (1) all models exhibit a performance cliff consistent across scales at the 128 to 256 token transition, with accuracy gains ranging from 8% to 51%, (2) larger models saturate at lower token bud- gets while achieving higher accuracy—the 7B model peaks at 512 tokens (86.8%) while the 0.5B model continues improving through 1024 tokens (18.7%), and (3) models can perform worse with excessive token budgets, with the 1.5B model losing 2.4% accuracy when increas- ing from 512 to 1024 tokens. These findings suggest that optimal token allocation strategies must account for model scale, and that practi- tioners should avoid over-allocating compute budgets at inference time.
Paper Type: Short
Research Area: AI/LLM Agents
Research Area Keywords: Token Budgets, LLM Reasoning
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: Python
Submission Number: 3913
Loading