Rethinking LLM Ensembling from the Perspective of Mixture Models

Rethinking LLM Ensembling from the Perspective of Mixture Models

ICLR 2026 Conference Submission18925 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Ensembing, Mixture models, Token-level routing

TL;DR: We revisit LLM ensembling from the perspective of mixture models and observe that: (1) it can be implemented by invoking only one model, and (2) it can be interpreted as the simplest form of token-level routing.

Abstract: Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to large language models (LLMs), yielding improved performance but incurring substantial computational cost. This inefficiency stems from directly applying conventional ensemble implementation to LLMs, which require a separate forward pass for each model to explicitly compute the ensemble distribution. In this paper, we revisit this conventional assumption and find that ensembling in the context of LLMs is fundamentally different. Unlike conventional models, LLMs typically generate tokens by sampling from the output distribution rather than selecting the top prediction via argmax. This key distinction enables us to reinterpret LLM ensembling as a mixture model. Under this perspective, one can sample from the ensemble distribution by simply selecting a single model at random and sampling from its output, which avoids the need to compute the full ensemble distribution explicitly. We refer to this approach as the **Mixture-model-like Ensemble** (ME). ME is mathematically equivalent to sampling from the ensemble distribution, but **requires invoking only one model**, making it **1.78×-2.68×** faster than conventional ensemble. Furthermore, this perspective connects LLM ensembling and token-level routing methods, suggesting that LLM ensembling is a special case of routing methods. Our findings open new avenues for efficient LLM ensembling and motivate further exploration of token-level routing strategies for LLMs. Our code is available at https://anonymous.4open.science/r/Mixture-model-like-Ensemble/.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 18925

Loading