Collaborative Spoken and Written Models for Conversational Language Modeling

Published: 01 Aug 2025, Last Modified: 26 Aug 2025SpeechAI TTIC 2025 OralorPosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Speech-text language models, conversation language modeling
Presentation Preference: Open to it if recommended by organizers
Abstract: Research about joint speech-text generation with language models has gained significant interest in recent years. These models aim to leverage the content generation capabilities acquired through text-based pre-training to improve long-context coherence in speech generation, a known challenge for pure speech models such as generative spoken language models (GSLMs). Additionally, information from the speech modality can provide valuable insights that do not exist in written language, potentially enhancing the model's capabilities in understanding and generating language. However, adapting pre-trained text-based language models to handle new sequence formats, often consisting of interleaved text and speech tokens, requires substantial training data and computational resources. In this research, we explore the possibility of decomposing the task into two parts, each handled by a model focused on a specific modality—one for text and one for speech. While both models have access to information from both modalities, they remain focused on generation within their respective domains. By avoiding the need to adapt models to new sequence formats, we aim to reduce the computational costs and resources required to develop joint speech-text generation frameworks, with the goal of facilitating the development of speech conversation systems using academic-level resources in the future.
Submission Number: 2
Loading