Keywords: LLM fusion, Query-level fusion, Thought-level fusion, Model-level fusion
TL;DR: We introduce LLMFusionBench, a large-scale routing benchmark, and FusionFactory, a multi-level LLM fusion framework that leverages routing data and thought templates to outperform the best individual LLMs across diverse tasks.
Abstract: The rapid advancement of large language models (LLMs) has created a diverse
landscape of models, each excelling at different tasks. This diversity drives re-
searchers to employ multiple LLMs in practice, leaving behind valuable multi-
LLM log data. This naturally leads to the question of whether such logs can be
fully leveraged to fuse LLMs’ complementary capabilities. Although prior work
has explored various strategies for integrating multiple LLMs, we argue that prac-
tical fusion must meet two essential requirements: (1) compatibility with real-
world serving scenarios (e.g., local and API-based serving), and (2) flexibility to
operate at different stages of the LLM pipeline to meet varied user needs (e.g.,
fine-tuning and inference stages). To this end, we introduce LLMFusionBench,
a large-scale benchmark for LLM fusion that spans 14 tasks across five domains,
with responses from 20 open-source LLMs (8B–671B) totaling 103M tokens.
Building on LLMFusionBench, we propose FusionFactory, a systematic
framework with three elaborated levels: (1) query-level fusion via tailored LLM
routers, (2) thought-level fusion leveraging retrieved abstract reasoning tem-
plates, and (3) model-level fusion via distillation from top-ranked responses. Ex-
periments show that FusionFactory consistently outperforms the best individ-
ual LLM across all 14 benchmarks, with the optimal fusion configuration varying
across benchmarks, highlighting the promise of multi-LLM log data as a practical
foundation for fusing diverse LLM capabilities.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20262
Loading