Ground Truth-Free WER Prediction for ASR via Audio Quality and Model Confidence Features

Anton Polevoi, Alexander Kragin, Natalia Loukachevitch

Published: 01 Jan 2026, Last Modified: 15 Apr 2026CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: We propose a data-driven approach for predicting Word Error Rate (WER) without requiring ground truth transcriptions. Our method involves creating diverse audio datasets by applying various noise types, acoustic degradations, and room impulse responses to clean speech samples across many fine-grained quality and intelligibility levels. Unlike previous work, we extract and analyze a comprehensive set of speech quality features including signal-to-noise ratio (SNR) estimates, modern neural audio quality metrics (such as NISQA), and ASR (Automatic Speech Recognition) model confidence scores to train WER prediction models. We conduct experiments across multiple languages with state-of-the-art ASR architectures (Whisper and FastConformer) to demonstrate our method’s effectiveness in predicting WER in diverse acoustic conditions. We also show that our approach generalizes in a multilingual model-unified setting. We provide feature importance analysis to identify key metrics needed to predict WER. This work enables practical applications such as quality-based filtering of audio inputs, allowing ASR systems to assess expected performance and estimate transcription reliability without ground truth transcripts.

External IDs:doi:10.1007/978-3-032-07959-6_3