Representational Alignment between Deep Neural Networks and Human Brain in Speech Processing under Audiovisual Noise

ICLR 2026 Conference Submission15162 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Automatic Speech Recognition, EEG, Information Representation
Abstract: Speech recognition in the human brain is an incremental process that begins with acoustic processing and advances to linguistic processing. While recent studies have revealed that the hierarchy of deep neural networks (DNNs) correlates with the ascending auditory pathway, the exact nature of this DNN-brain alignment remains underexplored. In this study, we investigate how DNN representations align with the brain's acoustic-to-linguistic processing. Specifically, we employed neural encoding models to simulate neural responses to acoustic (i.e., speech and noise envelope) and linguistic features (i.e., word onset and surprisal). By applying representational similarity analysis (RSA), we quantified the similarity between these neural responses and the DNN embeddings generated by a pre-trained automatic speech recognition (ASR) model, both before and after fine-tuning on audiovisual noisy data. Our results demonstrate significant DNN-brain alignment: embeddings from shallow layers exhibit higher similarity to neural responses associated with acoustic features, while those from deeper layers align more closely with neural responses related to linguistic features. Importantly, the fine-tuning process enhances this alignment by improving noise processing in shallower layers and refining linguistic representations in deeper layers. These results suggest that fine-tuned DNN models can naturally develop human-like processing patterns in noisy environments, highlighting a functional alignment between the human brain and DNNs in speech representation.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 15162
Loading