Decomposing Smooth Agentic Inference Scaling
Keywords: LLMs, Evals, Inference Scaling, Cybersecurity
TL;DR: We find that smooth inference scaling on cyber and SWE tasks is explained by discrete improvements on individual tasks, which can be used to better explain how agents improve over time and model their behaviour.
Abstract: Recent evaluations of frontier large language models (LLMs) reveal a new scaling law: aggregate success-rates scale linearly with logarithmic increases in serial token budgets. Explaining this trend is essential for predicting LLM capabilities from evaluations. In this work, we show that inference scaling is better understood through performance on individual tasks, and apply this view to two agent evaluations. We show that serial scaling is distinct from parallel scaling due to narrow variance in log-tokens to success for task inference curves, motivating a sigmoidal approximation with a random walk model. This per-task view reveals both limits and opportunities for predicting LLM capabilities from limited data and older models: new models unlock new tasks that are hard to predict; but Bayesian fits to $20$ runs censored at $1$M tokens already distinguish successive model generations, despite the full evaluation requiring $5$ runs across $107$ tasks at a $50$M token budget each. Our results show that evaluators will need to continue scaling up serial inference compute to assess agent capabilities, whilst highlighting the opportunity to complement this scaling with methods that extract more from each evaluation result.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 201
Loading