How Do Large Language Monkeys Get Their Power (Laws)?

Rylan Schaeffer; Joshua Kazdan; John Hughes; Jordan Juravsky; Sara Price; Aengus Lynch; Erik Jones; Robert Kirk; Azalia Mirhoseini; Sanmi Koyejo

How Do Large Language Monkeys Get Their Power (Laws)?

Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, Sanmi Koyejo

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts. In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts. We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge? We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own. We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, ${\sim}2-4$ orders of magnitude less inference compute. Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.

Lay Summary: Recent research has shown a curious pattern: when language AIs are given multiple tries at a set of tasks, their overall success improves according to a "power law"—a predictable, but not overly fast, curve. This was puzzling because, for any single task, more tries should make success much more likely, very quickly (exponentially). Our work solves this by showing that while individual tasks do follow this rapid improvement, the overall power law emerges due to how task difficulties are spread. Specifically, a small number of extremely hard tasks, where the AI has a tiny chance of success on any single attempt, collectively slow down the average improvement to a power law, even as each problem is still being tackled exponentially faster with more tries. Understanding this allows us to explain why some AI models or tasks don't follow this power law (they lack enough super-hard problems) and, more importantly, lets us predict this scaling behavior much more efficiently, using far less computational power, simply by looking at the initial success rates, especially on those toughest challenges.

Link To Code: https://github.com/RylanSchaeffer/KoyejoLab-Large-How-Do-Language-Monkey-Power-Get-Their-Power

Primary Area: General Machine Learning->Evaluation

Keywords: scaling laws, inference compute, scaling inference compute, language models, evaluations, scaling-predictable evaluations

Submission Number: 12080

Loading