CLSR: End-to-end Contrastive Language-Speech Retriever For Better Speech Retrieval Augmented Generation
Abstract: Significant progress has been made in spoken question answering in recent years. However, many of the existing methods including Large Audio Language Models (LALMs), have only been developed for short audio files and have difficulty in processing long audio. Speech Retrieval Augmented Generation (SRAG) follows the success of RAG in processing long-form speech, where an effective retriever serves as a critical first step. However, cross-modal retrievers in SRAG remain understudied, with current approaches either relying on pipeline methods (ASR followed by text RAG) or generic audio-text alignment models. To address this challenge, we propose proposes CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream RAG processing. Unlike conventional speech-text contrastive models that directly align cross-modal representations, CLSR introduces an intermediate step by first mapping acoustic features into text-like representations before alignment, bridging the modality gap more effectively. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR outperforms both end-to-end speech-text retrievers and pipeline approaches combining ASR with text retrieval. Our pre-trained CLSR model establishes a new state-of-the-art in cross-modal language-speech alignment, significantly surpassing previous general language-audio model like CLAP, thereby providing a robust foundation for advancing practical SRAG applications.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: QA via spoken queries; multimodal QA; passage retrieval; contrastive learning
Contribution Types: NLP engineering experiment, Theory
Languages Studied: English, Chinese
Submission Number: 2083
Loading