RECAST: Retrieval-Augmented Contextual ASR via Decoder-State Keyword Spotting

RECAST: Retrieval-Augmented Contextual ASR via Decoder-State Keyword Spotting

ACL ARR 2025 May Submission3469 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Contextual biasing in ASR systems is critical for recognizing rare, domain-specific terms but becomes impractical with large keyword dictionaries due to prompt size and latency constraints. We present RECAST--a lightweight retrieval-augmented approach that repurposes decoder states of a pretrained ASR model to retrieve relevant keywords without requiring audio exemplars. RECAST introduces a contrastively trained retriever that aligns decoder-state embeddings with textual keyword representations, enabling fast token-level retrieval over large dictionaries. Retrieved keywords are ranked and formatted into a prompt to guide a downstream speech language model. Trained solely on LibriSpeech and evaluated on out-of-domain benchmarks covering up to 4,000 keywords across diverse domains, RECAST consistently outperforms full-list prompt biasing and strong phonetic/text baselines. It achieves up to 54.3\% relative reduction in entity WER and 41.3\% overall WER improvement over the baseline, along with up to 2.5$\times$ higher recall in challenging settings. Furthermore, RECAST remains effective for diverse languages such as Hindi, demonstrating its scalability, language-agnostic design, and practicality for real-world contextual ASR.

Paper Type: Long

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: automatic speech recognition,spoken language grounding

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English,Hindi

Submission Number: 3469

Loading