Keywords: Ensembling language models.+Ensemble method.+Generative.+instruction response
TL;DR: SpecEM is a plug-and-play LLM ensemble that boosts strong models and enables fast, segment-level collaboration. It outperforms prior methods across six tasks and five LLMs.
Abstract: Ensembles of generative large language models (LLMs) are a promising way to compensate for individual model limitations, integrating the strengths of different LLMs. Existing LLM ensemble methods, however, face limitations such as first-token delay and challenges in long-range semantic collaboration between models, Moreover, they typically assume equal voting weights for all models during ensemble, ignoring performance differences between models for a given task. In this work, we propose SpecEM, a training-free, plug-and-play LLM ensemble framework that dynamically adjusts each model's model contribution in real time based on task performance. Inspired by speculative decoding, SpecFuse iteratively performs drafting and verification, allowing models to collaborate semantically at the segment level for integrated output. Furthermore, we introduce an online feedback mechanism with multiplicative weight updates, where each model's voting weight is adjusted on-the-fly according to how often it "outperforms" others during verification stage, ensuring that stronger models exert greater influence on the ensemble during generation. Experimental results on five popular LLMs (ranging from 7B to 72B parameters) and six benchmark tasks, spanning instruction following, reasoning, commonsense, and general instruction response, demonstrate consistent performance improvements compared to state-of-the-art LLM ensemble methods.
Supplementary Material:  zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 14462
Loading