FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training

ICLR 2026 Conference Submission17386 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Audio Language Model, Full-duplexity, Spoken Dialog
TL;DR: We propose natural monologue and dual training paradigm to improve native full-duplex spoken dialog models
Abstract: Full-duplex dialog models aim to listen and speak simultaneously, delivering rapid responses to dynamic user input. Among different solutions to full-duplexity, a native solution merges multiple channels in each time step, achieving the lowest latency. However, prevailing designs break down the textual monologue sentences for word-level alignment with audio streams, which degrades language modeling abilities. To help address this issue, we introduce “natural monologues”, which are composed by continuous sentences and “waiting” intervals, mimicking humanoid cognitive behavior in dialogs. We find a proper training paradigm to be critical for semantically aligning natural monologues with audio. To this end, we develop a “dual” training paradigm that alternates the position of the monologues, either leading or trailing the audio, across different training stages. A combination of our natural monologue and dual training strategy is applied in developing FLM-Audio, our 7B spoken dialog chatbot with native full-duplexity. As confirmed by experimental results, FLM-Audio achieves superior response qualities and chatting experiences while requiring significantly less training data.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17386
Loading