Abstract: While large multilingual automatic speech recognition (ASR) models achieve remarkable performance, the internal mechanisms of the end-to-end pipeline, particularly concerning fairness and efficacy across languages, remain underexplored. This paper introduces a fine-grained analysis of Whisper’s multilingual decoder, examining its sub-token hypotheses during transcription across languages with various resource levels. Our method traces the beam search path, capturing sub-token guesses and their associated probabilities. Results reveal that higher resource languages benefit from markedly higher likelihood of the correct token being top-ranked in candidate guesses, higher confidence, lower predictive entropy, and more diverse alternative candidates. Lower resource languages fare worse on these metrics, but also exhibit distinct clustering patterns in sub-token usage sometimes influenced by typology in our PCA analysis. This sub-token probing uncovers systematic decoding disparities masked by aggregate error rates and points towards targeted interventions to ameliorate the imbalanced development of speech technology.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: bias, fairness, interpretability, error analysis, multilinguality, automatic speech recognition, meta-analysis, low-resource languages, cross-lingual, typology
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: "German, Spanish, French, Portuguese, Turkish, Italian, Swedish, Dutch, Catalan, Finnish, Indonesian, Hungarian, Romanian, Norwegian, Welsh, Lithuanian, Latvian, Azerbaijani, Estonian, Basque
Submission Number: 5091
Loading