STCCA: Spatial–Temporal Coupled Cross-Attention Through Hierarchical Network for EEG-Based Speech Recognition

Liang Dong, Hengyi Shao, Lin Zhang, Lei Li

Published: 23 Oct 2025, Last Modified: 26 Jan 2026SensorsEveryoneRevisionsCC BY-SA 4.0

Abstract: Speech recognition based on Electroencephalogram (EEG) has attracted considerable attention due to its potential in communication and rehabilitation. Existing methods typically process spatial and temporal features with sequential, parallel, or constrained feature fusion strategies. However, the intricate cross-relationships between spatial and temporal features remain underexplored. To address these limitations, we propose a spatial–temporal coupled cross-attention mechanism through a hierarchical network, named STCCA. The proposed STCCA consists of three key components: local feature extraction module (LFEM), coupled cross-attention (CCA) fusion module, and global feature extraction module (GFEM). The LFEM employs CNNs to extract local temporal and spatial features, while the CCA fusion module leverages a dual-directional attention mechanism to establish deep interactions between temporal and spatial features. The GFEM uses multi-head self-attention layers to model long-range dependencies and extract global features comprehensively. STCCA is validated on three EEG-based speech datasets, achieving accuracies of 45.45%, 25.91%, and 29.07%, corresponding to improvements of 1.95%, 3.98%, and 1.98% over the comparison models.

External IDs:doi:10.3390/s25216541