RoBoN: Routed Online Best-of-n for Test-Time Scaling with Multiple LLMs

Published: 23 Sept 2025, Last Modified: 07 Dec 2025FoRLM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: test-time scaling; best-of-n; LLMs; inference; model ensemble
TL;DR: We present a training-free test-time scaling method for ensembles of LLMs
Abstract: Best-of-$n$ is a widely used test-time scaling approach for LLM inference. Yet despite evidence that LLMs exhibit complementary strengths across tasks, traditionally best-of-$n$ relies on a single model to generate responses. We propose RoBoN (Routed Online Best-of-$n$), a sequential multi-LLM alternative to the prevailing single-model best-of-$n$. Given a suite of models $\{m_i\}_{i=1}^M$, RoBoN sequentially routes generations one-by-one across models, based on scores computed using a reward model and an agreement signal on the predicted responses. This online routing requires no additional training, keeps compute parity, and works with any plug-in reward model. Across three math benchmarks (MATH500, OlympiadBench, MinervaMath), RoBoN consistently outperforms standard best-of-$n$ applied to each individual model, with gains up to 5\% absolute accuracy, and also improves over a uniform multi-model portfolio baseline. Our results indicate that diversity across models can be exploited sequentially at inference to realize better best-of-$n$ performance than any constituent model alone, providing a simple, training-free path to test-time scaling with multiple LLMs.
Submission Number: 120
Loading