How Much Context Is Enough? Evaluating the Role of Audio and Textual Context in ASR Systems

ACL ARR 2025 May Submission4795 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Automatic Speech Recognition (ASR) systems often process audio in short segments, limiting their ability to leverage broader context. This work systematically explores how increasing both audio and textual context length affects ASR performance. We evaluate multiple architectures- including Fast Conformer with CTC and RNN-T and multimodal models like Whisper and Qwen2—Audio across a range of context windows from a few seconds up to fifteen minutes. Empirical results on both short and long-form English contexts, as well as a Korean lecture dataset, reveal that longer context windows can significantly reduce transcription errors and improve coherence. However, excessive context sometimes saturates or even harms performance due to computational overhead and error propagation. Our findings highlight the importance of carefully balancing context length to maximize ASR performance while mitigating potential drawbacks.
Paper Type: Short
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: automatic speech recognition
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: English, Korean
Submission Number: 4795
Loading