ProfASR-Bench: A Professional-talk ASR Dataset for High-Stakes Applications Exposing the Context-Utilization Gap

ICLR 2026 Conference Submission13917 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: automatic speech recognition, contextual ASR, professional-talk speech, high-stakes applications, domain-specific speech (finance/medical/legal/tech), entity-aware evaluation, fairness and bias (accent/gender), prompt-conditioned evaluation
TL;DR: Professional-talk ASR dataset *+* contextual eval protocol for high-stakes domains; tests prompts, entities, accents, genders.
Abstract: Automatic Speech Recognition (ASR) in professional settings faces challenges that existing benchmarks underplay: dense domain terminology, formal register variation, and near-zero tolerance for critical entity errors. We present \textsc{ProfASR-Bench}, a \emph{professional-talk} evaluation suite for high-stakes applications across finance, medicine, legal, and technology. Each example pairs a natural-language \emph{prompt} (domain cue and/or speaker profile) with an entity-rich target utterance, enabling controlled measurement of \emph{context-conditioned} recognition. The corpus supports conventional metrics alongside \emph{entity-aware} scores and slice-wise reporting by accent and gender. Using representative families \emph{Whisper} (encoder–decoder ASR) and \emph{Qwen-Omni} (audio LM) under matched \emph{no-context}, \emph{profile}, \emph{domain+profile}, \emph{oracle}, and \emph{adversarial} conditions, we uncover a consistent pattern: lightweight textual context produces little to no change in average WER, even when providing the gold transcript as an oracle prompt, and adversarial prompts do not reliably degrade WER. We term this the \textbf{\emph{context-utilization gap(CUG)}}: current systems are nominally promptable yet underuse readily available side information. Entity-centric analyses reveal only modest, model-dependent gains on information-bearing tokens, underscoring the need for stronger fusion mechanisms and calibrated trust in prompts.\textsc{ProfASR-Bench} contributes (i) a standardized \emph{context ladder} with paired, within-utterance estimation; (ii) entity-aware and slice-aware reporting with confidence intervals; and (iii) a reproducible testbed to compare fusion strategies across model families. We release data and code to foster comparable, context-aware evaluation in high-stakes domains.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 13917
Loading