Temporal Aware Iterative Speech Model for Dementia Detection

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: IADL, Dementia detection, Optical Flow, Context-awareness, Cross-attention system
TL;DR: We created an AI that "watches" speech like a video, using motion-tracking principles to detect dementia by analyzing how acoustic patterns change over time, without needing to understand the words.
Abstract: Current acoustic markers for dementia detection often rely on static feature aggregation or error-prone linguistic transcription (ASR), thereby failing to capture the fine-grained, frame-to-frame temporal deterioration of articulatory motor control. To address this, we introduce TAI-Speech, an ASR-free framework that models speech deterioration as a continuous temporal trajectory analogous to physical motion. Our architecture introduces two key innovations: 1) Optical Flow-inspired Iterative Refinement: By treating spectrograms as sequential frames, this component uses a convolutional GRU to capture the fine-grained, frame-to-frame evolution of acoustic features; and 2) Cross-Modal Attention, which dynamically aligns spectral features with prosodic contours (pitch and pauses) to detect pathological mismatches. Experimental evaluation on the DementiaBank Corpus demonstrates that TAI-Speech achieves an AUC of 83.9\% and a recall of 89.0\%. Importantly, our model surpasses strong state-of-the-art baselines on AUC ROC, including fine-tuned Wav2Vec 2.0 (67.9\%), Audio Spectrogram Transformers (74.8\%), and CNNs (76.8\%). These results confirm that explicitly modeling acoustic flow yields superior diagnostic sensitivity compared to latent linguistic representations or static classifiers, offering a privacy-preserving and computationally efficient solution for early cognitive screening.
Supplementary Material: zip
Primary Area: applications to neuroscience & cognitive science
Submission Number: 22649
Loading