How Much Context Is Enough? Evaluating the Role of Audio and Textual Context in ASR Systems

How Much Context Is Enough? Evaluating the Role of Audio and Textual Context in ASR Systems

ACL ARR 2025 May Submission4795 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Automatic Speech Recognition (ASR) systems often process audio in short segments, limiting their ability to leverage broader context. This work systematically explores how increasing both audio and textual context length affects ASR performance. We evaluate multiple architectures- including Fast Conformer with CTC and RNN-T and multimodal models like Whisper and Qwen2—Audio across a range of context windows from a few seconds up to fifteen minutes. Empirical results on both short and long-form English contexts, as well as a Korean lecture dataset, reveal that longer context windows can significantly reduce transcription errors and improve coherence. However, excessive context sometimes saturates or even harms performance due to computational overhead and error propagation. Our findings highlight the importance of carefully balancing context length to maximize ASR performance while mitigating potential drawbacks.

Paper Type: Short

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: automatic speech recognition

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data analysis

Languages Studied: English, Korean

Submission Number: 4795

Loading