When More Tokens Hurt: Saturation Effects in Test-Time Compute Scaling

When More Tokens Hurt: Saturation Effects in Test-Time Compute Scaling

ACL ARR 2026 January Submission3913 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Token Budgets, LLM Reasoning

Abstract: Recent work has shown that large language models can improve mathematical reasoning performance by allocating additional tokens at test time. However, the relationship between model scale and optimal token budgets remains unexplored. We conduct a systematic study of test-time compute scaling across four model sizes (0.5B to 7B parameters) on the GSM8K mathematical reasoning benchmark, evaluating performance at seven token budgets from 32 to 2048 tokens. We find three key results: (1) all models exhibit a performance cliff consistent across scales at the 128 to 256 token transition, with accuracy gains ranging from 8% to 51%, (2) larger models saturate at lower token bud- gets while achieving higher accuracy—the 7B model peaks at 512 tokens (86.8%) while the 0.5B model continues improving through 1024 tokens (18.7%), and (3) models can perform worse with excessive token budgets, with the 1.5B model losing 2.4% accuracy when increas- ing from 512 to 1024 tokens. These findings suggest that optimal token allocation strategies must account for model scale, and that practi- tioners should avoid over-allocating compute budgets at inference time.

Paper Type: Short

Research Area: AI/LLM Agents

Research Area Keywords: Token Budgets, LLM Reasoning

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: Python

Submission Number: 3913

Loading