TL;DR: Training-Free Multimodal Large Language Model Orchestration
Abstract: Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free Large Language Model Orchestration (LLM Orchestration), a training-free orchestration framework that integrates off-the-shelf modality experts into a unified multimodal input--output system without additional gradient-based training for integration. LLM Orchestration comprises three components: (1) an LLM controller that infers user intent and emits explicit control tokens for expert selection and sequencing, enabling protocol-constrained and auditable routing; (2) a text-centric cross-modal memory that compresses multimodal evidence into structured records for lightweight retrieval and reuse, reducing redundant expert invocations across turns; and (3) a unified interaction layer that executes routing and memory decisions to support consistent modality transitions, full-duplex streaming, and interruption-aware dialogue. Across diverse multimodal benchmarks, LLM Orchestration achieves strong performance under standard evaluation constraints while maintaining low orchestration overhead and modular upgradeability, providing a practical alternative to costly joint training for omni-modal systems.
Lay Summary: The construction of interactive multi-modal intelligent systems generally adopts end-to-end feature alignment fusion, which brings high consumption of data and computing resources and restricts subsequent scalable development. This study proposes a training-free large language model scheduling framework, which can integrate existing mature modal functional modules into an integrated system without additional training optimization. It realizes user intention recognition and reasonable task scheduling through large language models, optimizes information storage and invocation modes, and ensures stable and coherent human-computer interaction. This method achieves competitive experimental performance in multiple evaluation scenarios with low operational cost and good modular expandability, presenting a lightweight and feasible solution for the construction of universal multi-modal intelligent platforms.
Link To Code: https://github.com/MAC-AutoML/Trainingfree-LLM-Orchestration
Primary Area: Applications->Language, Speech and Dialog
Keywords: Omni, LLM, Multimodal, Training-Free, Full-Duplex
Originally Submitted PDF: pdf
Submission Number: 6919
Loading