SpecFuse: Ensembling Large Language  Models via Next-Segment Prediction

Bo Lv; Chen Tang; Yanan Zhang; Yue Yu; Ping Luo

SpecFuse: Ensembling Large Language Models via Next-Segment Prediction

Bo Lv, Chen Tang, Yanan Zhang, Yue Yu, Ping Luo

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Ensemble LLMs, Next-Segment Prediction, Generative, open-domain instruction response

TL;DR: We introduce SpecFuse, a novel ensemble framework that generates fused outputs by iteratively producing the next segment through collaboration among LLMs, allowing base LLMs to be seamlessly integrated without any training or adaptation.

Abstract: Ensembles of generative large language models (LLMs) can integrate the strengths of different LLMs to compensate for the limitations of individual models. However, recent work has focused on training an additional fusion model to combine complete responses from multiple LLMs, failing to tap into their collaborative potential to generate higher-quality responses. Moreover, as the additional fusion model is trained on a specialized dataset, these methods struggle with generalizing to open-domain queries from online users. In this paper, we propose SpecFuse, a novel ensemble framework that outputs the fused result by iteratively producing the next segment through collaboration among LLMs. This is achieved through cyclic execution of its inference and verification components. In each round, the inference component invokes each base LLM to generate candidate segments in parallel, and the verify component calls these LLMs again to predict the ranking of the segments. The top-ranked segment is then broadcast to all LLMs, encouraging them to generate higher-quality segments in the next round. This approach also allows the base LLMs to be plug-and-play, without any training or adaptation, avoiding generalization limitations. Furthermore, to conserve computational resources, we propose a model exit mechanism that dynamically excludes models exhibiting poor performance in previous rounds during each query response. In this way, it effectively reduces the number of model calls while maintaining overall performance. We conduct extensive experiments using ensembles of five LLMs with different architectures across six benchmarks, covering instruction-response, reasoning, commonsense, and instruction-following tasks. The experimental results demonstrate that SpecFuse consistently enhances performance across all benchmarks, with RougeL scores improving by $+3.1$ on the Chinese and $+3.0$ on the English human-computer interaction benchmarks. Furthermore, the model exit mechanism reduces the average models invoked per round from $5$ to $2.4$, with only a slight reduction in performance.

Supplementary Material: zip

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10114

Loading