Keywords: Token Budgets, LLM Reasoning
Abstract: Recent work has shown that large language
models can improve mathematical reasoning
performance by allocating additional tokens at
test time. However, the relationship between
model scale and optimal token budgets remains
unexplored. We conduct a systematic study of
test-time compute scaling across four model
sizes (0.5B to 7B parameters) on the GSM8K
mathematical reasoning benchmark, evaluating
performance at seven token budgets from 32 to
2048 tokens. We find three key results: (1) all
models exhibit a performance cliff consistent
across scales at the 128 to 256 token transition,
with accuracy gains ranging from 8% to 51%,
(2) larger models saturate at lower token bud-
gets while achieving higher accuracy—the 7B
model peaks at 512 tokens (86.8%) while the
0.5B model continues improving through 1024
tokens (18.7%), and (3) models can perform
worse with excessive token budgets, with the
1.5B model losing 2.4% accuracy when increas-
ing from 512 to 1024 tokens. These findings
suggest that optimal token allocation strategies
must account for model scale, and that practi-
tioners should avoid over-allocating compute
budgets at inference time.
Paper Type: Short
Research Area: AI/LLM Agents
Research Area Keywords: Token Budgets, LLM Reasoning
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: Python
Submission Number: 3913
Loading