Acoustic Degradation Reweights Cortical and ASR  Processing: A Brain-Model Alignment Study

Francis Pingfan Chien; Chia-Chun Dan Hsu; Po-Jang Hsieh; Yu Tsao

Acoustic Degradation Reweights Cortical and ASR Processing: A Brain-Model Alignment Study

Francis Pingfan Chien, Chia-Chun Dan Hsu, Po-Jang Hsieh, Yu Tsao

Published: 23 Sept 2025, Last Modified: 17 Feb 2026CogInterp @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: model–brain alignment, Whisper Tiny, fMRI, encoding models, brain score, predictive processing, precision weighting, speech-in-noise, auditory cortex, inferior frontal gyrus, middle frontal gyrus, layer-wise analysis

TL;DR: Layerwise alignment between model layers and brain regions shows clean speech aligns with frontal prediction and noise with auditory cortex; fMRI and behavior in 25 listeners support a precision weighted account of speech perception.

Abstract: We tested whether acoustic degradation changes how a modern ASR represents speech and whether those changes explain human brain and behavior. Twenty-five participants listened to clean and noisy ($-3$ dB SNR) Mandarin sentences during fMRI while we extracted layer-wise embeddings from Whisper-Tiny. We computed brain scores normalized to each ROI's noise ceiling. Behavioral assessments, including intelligibility, perceived quality, and comprehension, declined under noise. Under clean speech, alignment emphasized frontal predictive processing, with encoder layers 3 and 4 peaking in the right inferior frontal gyrus (IFG) and decoder layer 2 peaking in the left middle frontal gyrus (MFG). Under noisy speech, alignment shifted toward early acoustic and evaluative regions, with encoder layer 1 peaking in the right Heschl’s gyrus and encoder layer 4 peaking in the right IFG pars orbitalis (IFGorb), and decoder peaks were weaker and more diffuse. Condition contrasts showed higher alignment for clean speech in the right IFG (encoder layers 3 and 4) and the left MFG (decoder layer 2). These findings demonstrate a processing account, a behavioral link, and a compact layer-to-region map across listening conditions.

Submission Number: 28

Loading