TL;DR: A duplex speech-to-speech dialogue system based on a frozen LLM.
Abstract: The GPT-4o's excellent duplex speech interaction ability has given users an impressive experience. Researchers have recently proposed several multimodal LLMs to achieve user-agent speech-to-speech conversations. In this paper, we propose a novel speech-text multimodal LLM architecture called Freeze-Omni, and our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM's parameters frozen throughout the training process. We effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level as that in the text modality of its backbone LLM while achieving low latency in the end-to-end spoken response. In addition, we also designed a method to achieve duplex dialogue ability through multitask training, giving Freeze-Omni a more natural style of dialogue ability between users and agents. In summary, Freeze-Omni holds great potential to conduct speech-to-speech dialogue based on a multimodal LLM under the condition of a frozen LLM, avoiding the catastrophic forgetting problem caused by limited data and training resources.
Lay Summary: Building AI systems that can have natural, real-time voice conversations like humans requires connecting speech inputs to powerful language models. However, retraining these models for speech often demands massive resources and risks degrading their existing text-based intelligence.
We designed Freeze-Omni, a system that adds speech interaction to large language models (LLMs) without altering their core knowledge. Imagine plugging a microphone and speaker into a frozen AI brain—our method trains only the speech components while keeping the LLM’s original skills intact. We also taught the system to handle smooth back-and-forth dialogue, mimicking natural human conversation.
Freeze-Omni enables voice assistants to respond as intelligently in speech as they do in text, with minimal delay. This approach reduces training costs, avoids "forgetting" previous knowledge, and paves the way for more accessible, human-like AI communication tools—even for teams with limited data or computing power.
Link To Code: https://github.com/VITA-MLLM/Freeze-Omni
Primary Area: Applications->Language, Speech and Dialog
Keywords: Speech to Speech, Duplex Dialogue Model, Multimodal Large Language Models
Submission Number: 8855
Loading