RecognAVSE-V2: An Improved Cross-Attention-Based Audio-Visual Speech Enhancement Approach

Published: 08 Mar 2026, Last Modified: 08 Mar 2026ICCSIC 2026 OralEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 2: Multimodal and Cross-modal Intelligence
Keywords: Audio-Visual Speech Enhancement, Cross-Attention Mechanism, Feature Fusion, Lip-Speech Synchronization, Audio-Video Temporal Alignment
TL;DR: It proposes RecognAVSE-V2, an improved version of RecognAVSE, an audio-visual speech enhancement method, by implementing a Time-Synced Cross-Attention Mechanism to exploit temporal correlations between audio and video features.
Abstract: Audio-visual speech enhancement (AVSE) is a subfield of machine learning that aims to improve speech quality from noisy audio signals, leveraging visual cues to guide information flow and enhance learning. In this context, the COG-MHEAR program supports an early competition, namely the AVSE Challenge (AVSEC), to advance the development of new solutions and AVSE methods while fostering the community. This paper introduces the RecognAVSE-V2, an improved and more efficient version of RecognAVSE proposed at AVSEC 2024, that implements a Time-Synced Cross-Attention Mechanism to exploit temporal correlations between audio and video features. The method's architecture comprises a video encoder that extracts spatiotemporal features from raw video frames, an STFT-based audio encoder that captures spectral features of the audio signal, and a time-synchronized cross-attention module that aligns audio and video features. Experimental results conducted over AVSE Challenge and CHiME3 datasets show RecognAVSE-V2 is capable of outperforming the baseline and its prior version in some cases while consisting of a model with reduced complexity, $20$ times smaller, whose inference is $26\%$ faster, and training demands a quarter of the total time.
Submission Number: 21
Loading