Domain-Specific Adaptation for ASR through Text-Only Fine-Tuning

Published: 02 Dec 2025, Last Modified: 23 Dec 2025MMLoSo 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: ASR, Whisper, Domain adaptation, Fine Tuning
TL;DR: a text-only domain adaptation method for Whisper, fine-tuning only the decoder using domain-relevant text
Abstract: Speech recognition models often struggle in specialized domains due to the lack of domain-specific paired audio-text data, making it difficult to adapt general-purpose systems to unique terminology and linguistic patterns. In this work, we propose a text-only domain adaptation method for Whisper, fine-tuning only the decoder using domain-relevant text. Our approach introduces trainable cross-attention bias embeddings, extended with a gated mixture-of-experts routing mechanism, enabling the model to encode domain-specific linguistic priors without any audio data. Unlike ASR adaptation methods that require paired audio-text datasets, our approach is lightweight and resource-efficient. We observe up to a 56\% relative improvement in word error rate over the baseline. Our findings demonstrate that text-only adaptation is a practical and effective strategy for improving speech recognition in specialized domains with limited or no domain-specific audio.
Submission Number: 19
Loading