Keywords: test-time scaling; best-of-n; LLMs; inference; model ensemble
TL;DR: We present a training-free test-time scaling method for ensembles of LLMs
Abstract: Best-of-$n$ is a widely used test-time scaling approach for LLM inference. Yet despite evidence that LLMs exhibit complementary strengths across tasks, traditionally best-of-$n$ relies on a single model to generate responses.
We propose RoBoN (Routed Online Best-of-$n$), a sequential multi-LLM alternative to the prevailing single-model best-of-$n$.
Given a suite of models $\{m_i\}_{i=1}^M$, RoBoN sequentially routes generations one-by-one across models, based on scores computed using a reward model and an agreement signal on the predicted responses.
This online routing requires no additional training, keeps compute parity, and works with any plug-in reward model.
Across three math benchmarks (MATH500, OlympiadBench, MinervaMath), RoBoN consistently outperforms standard best-of-$n$ applied to each individual model, with gains up to 5\% absolute accuracy, and also improves over a uniform multi-model portfolio baseline. Our results indicate that diversity across models can be exploited sequentially at inference to realize better best-of-$n$ performance than any constituent model alone, providing a simple, training-free path to test-time scaling with multiple LLMs.
Submission Number: 120
Loading