Diffusion-based dynamics as a cognitive model of human speech production

Diffusion-based dynamics as a cognitive model of human speech production

ICLR 2026 Conference Submission22585 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: model-brain alignment, diffusion large language model, speech production, ECoG

TL;DR: Diffusion LLM embeddings explained ECoG signal variance in specific cortical regions more effectively than autoregressive LLMs.

Abstract: Human language production requires transforming abstract communicative intent into fluent speech, yet the algorithmic nature of this transformation remains less understood. Most studies aligning large language models (LLMs) with brain activity have focused on autoregressive LLMs (aLLMs), which generate text left-to-right by committing to the next token. While effective at predicting neural and behavioral signatures of comprehension, this paradigm assumes incremental generation. In contrast, diffusion LLMs (dLLMs) construct sentences by iteratively denoising global representations. Despite their distinct generative dynamics, dLLMs now rival aLLMs on standard NLP benchmarks, prompting the question of whether the brain likewise engages in global, iterative refinement—especially during pre-articulatory planning when sentence structure remains flexible. To test this hypothesis, we correlated intermediate denoising steps of a dLLM with electrocorticography (ECoG) activity during naturalistic speech production. dLLM representations explained significant neural variance from pre- to post-production, with especially strong encoding in middle/inferior temporal and motor-related regions. These results support iterative refinement as a plausible neural mechanism of human speech planning.

Primary Area: applications to neuroscience & cognitive science

Submission Number: 22585

Loading