CLSR: End-to-end Contrastive Language-Speech Retriever For Better Speech Retrieval Augmented Generation

CLSR: End-to-end Contrastive Language-Speech Retriever For Better Speech Retrieval Augmented Generation

ACL ARR 2024 December Submission2083 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Significant progress has been made in spoken question answering in recent years. However, many of the existing methods including Large Audio Language Models (LALMs), have only been developed for short audio files and have difficulty in processing long audio. Speech Retrieval Augmented Generation (SRAG) follows the success of RAG in processing long-form speech, where an effective retriever serves as a critical first step. However, cross-modal retrievers in SRAG remain understudied, with current approaches either relying on pipeline methods (ASR followed by text RAG) or generic audio-text alignment models. To address this challenge, we propose proposes CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream RAG processing. Unlike conventional speech-text contrastive models that directly align cross-modal representations, CLSR introduces an intermediate step by first mapping acoustic features into text-like representations before alignment, bridging the modality gap more effectively. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR outperforms both end-to-end speech-text retrievers and pipeline approaches combining ASR with text retrieval. Our pre-trained CLSR model establishes a new state-of-the-art in cross-modal language-speech alignment, significantly surpassing previous general language-audio model like CLAP, thereby providing a robust foundation for advancing practical SRAG applications.

Paper Type: Long

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: QA via spoken queries; multimodal QA; passage retrieval; contrastive learning

Contribution Types: NLP engineering experiment, Theory

Languages Studied: English, Chinese

Submission Number: 2083

Loading