MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE

Soheil Zibakhsh Shabgahi; Mohammad Samragh; Kumari Nishu; Lauren Hannah; Arnav Kundu; Minsik Cho

MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE

Soheil Zibakhsh Shabgahi, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Test-time Scaling, Mixture-of-Experts, Inference, Training-Free, MoE

TL;DR: We introduce Roster of Experts (RoE), a training-free inference method that efficiently increases MoE performance by turning a single model into a dynamic ensemble.

Abstract: The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction.To overcome the computational cost, we introduce an efficient batching strategy and a specialized KV-caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.

Supplementary Material: pdf

Primary Area: foundation or frontier models, including LLMs

Submission Number: 21620

Loading