Lychee-FD: Hierarchical Acoustic-Semantic Modeling for Full-Duplex Spoken Language Models

ACL ARR 2026 January Submission8056 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, Multimodal Large Language Model, Spoken Language Model
Abstract: Spoken Language Models (SLMs) have revolutionized voice interaction, yet they remain constrained by rigid half-duplex mechanisms that fail to replicate the fluidity of human conversation. While recent Full-Duplex SLMs attempt to bridge this gap by enabling real-time capabilities such as interruption and backchanneling, these methods suffer from severe modality interference. Specifically, adapting models for native full-duplex interaction often induces significant knowledge degradation, impeding the realization of seamless human-machine interaction.To address this, we conduct an optimization dynamics analysis, identifying the root cause as the inherent gradient conflict between acoustic rendering and semantic modeling within a shared parameter space. Guided by this insight, we introduce **Lychee-FD**, a native end-to-end full-duplex framework designed to mitigate modality interference. We proposed a hierarchical parameter separation strategy that decouples conflicting modalities in deep layers. Moreover, we incorporate a semantic alignment channel that enables the model to preserve coherent internal monologues, ensuring the robustness of semantic modeling during training.Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple full-duplex benchmarks, specifically delivering an average **7.4\%** improvement on Spoken QA tasks and **28.5\%** improvement on FullDuplexBench 1.5. Consequently, our work uncovers the fundamental causes of modality interference within Full-Duplex SLMs and provides an effective approach to reconcile interaction efficiency with robust knowledge retention.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: spoken dialogue system, dialogue state tracking, applications
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 8056
Loading