Keywords: Inference time-compute, Scaling, mixture
TL;DR: We analyze effect of mixing model solutions at test time on inerence scaling laws, finding mixed strategies produce more diverse solutions and improve scaling laws
Abstract: Scaling inference time compute has enabled a significant improvement in model mathematical problem solving ability. However, most inference scaling strategies sample only from a single model. We extend and analyze inference scaling in the mixed model setting, where samples from weak but inexpensive and strong but expensive models can be pooled at test time. We find mixing samples over a distribution of problems can outperform the best pure, single model strategy by over 5\% when given the same compute budget. Further, model mixing extends the compute regimes for which inference scaling reliably improves performance. However, as part of our analysis, we prove that for a \textbf{fixed problem} $Q$ a pure strategy sampling only a single model is most efficient. Further, the best model can be identified as having the largest \textit{compute normalized probability} of success for $Q$. This implies the observed empirical improvements from model mixing stem from an average improvement over the problem distribution as opposed to improvement over the best pure strategy for any single problem. To better understand this result we empirically analyze the distribution of compute normalized probabilities over problems for variously sized models. Our analysis reveals each model is best suited for efficiently solving a non-trivial subset of problems, further motivating the effectiveness of mixing solutions. Somewhat surprisingly, this remains true even for the hardest set of problems, where, for example, the smallest model is most efficient in solving 25\% of the problem set.
Submission Number: 136
Loading