Condition-Dependent Representational Alignment  between Whisper and the Human Speech Network

Chia-Chun Dan Hsu; Francis Pingfan Chien; Rong Chao; Ching Chih Sung; Yu-Te Wang; Po-Jang Hsieh; Yu Tsao

Condition-Dependent Representational Alignment between Whisper and the Human Speech Network

Chia-Chun Dan Hsu, Francis Pingfan Chien, Rong Chao, Ching Chih Sung, Yu-Te Wang, Po-Jang Hsieh, Yu Tsao

Published: 23 Sept 2025, Last Modified: 17 Nov 2025UniReps2025EveryoneRevisionsBibTeXCC BY 4.0

Track: Extended Abstract Track

Keywords: model brain alignment, layerwise mapping, Whisper Tiny, encoding models, fMRI, speech in noise, brain score, predictive processing, precision weighting, auditory cortex, inferior frontal gyrus, middle frontal gyrus

TL;DR: Layerwise alignment between model layers and brain regions shows clean speech aligns with frontal prediction and noise with auditory cortex; fMRI and behavior in 25 listeners support a precision weighted account of speech perception.

Abstract: Representations in modern speech models often align with human brain activity, but how acoustic degradation alters this alignment remains unclear. Here, we quantify condition-sensitive model–brain correspondence between an automatic speech recognition (ASR) model and the human cortex. Twenty-five participants listened to clean and noisy (−3 dB SNR) sentences while undergoing fMRI. Layer-wise embeddings from Whisper Tiny (an encoder–decoder Transformer) were mapped to voxel time series using ridge-regularized linear encoding to obtain normalized neural predictivity. Under clean speech, alignment peaked for decoder representations in the left middle frontal gyrus (MFG), with additional encoder peaks in the right inferior frontal gyrus (IFG). Under noisy speech, peaks shifted toward encoder layers in the right Heschl’s gyrus and the right IFG pars orbitalis (IFGorb). Moreover, we observed significantly higher neural predictivity for clean than for noisy speech in the right IFG at middle and late encoder layers and in the left MFG at a middle decoder layer. These results demonstrate condition-dependent cortical alignment profiles across model layers and suggest a dynamic reweighting between feedforward acoustic encoding and top-down predictive decoding.

Submission Number: 94

Loading