Keywords: Whisper, STT
TL;DR: We introduce a lightweight, interpretable metric called Local Confidence Drop to detect hallucinations in speech recognition models by identifying sudden breaks in contextual stability.
Abstract: Automatic speech recognition has advanced significantly with models like Whisper, yet confident hallucinations remain a critical challenge. In this work, we propose a lightweight and interpretable error detection framework that augments acoustic confidence with explicit contextual features. We introduce the Local Confidence Drop, a novel metric designed to capture sudden stability dips between neighboring tokens. Evaluated on the FLEURS dataset, our fandom forest classifier achieves 0.64 AP, consistently outperforming the baseline (p < 0.001). Crucially, we demonstrate that hallucinations manifest as local contextual discontinuities, providing a transparent alternative to opaque neural post-processors.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 7
Loading