AV(SE)2: Audio-Visual Squeeze-Excite Speech Enhancement

Michael L. Iuzzolino, Kazuhito Koishida

2020 (modified: 15 Sept 2021)ICASSP 2020Readers: Everyone

Abstract: The goal of audio-visual speech enhancement (AVSE) is to supplement audio-only information with visual information, such as target speaker's lip movements, to improve the intelligibility and overall perceptual quality of noisy speech signals. We propose a new mechanism for audio-visual (AV) fusion that leverages a cross-modal squeeze-excitation (SE) block for speech enhancement: AV(SE) <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> . The fusion block is adaptable to any feature layer of the audio and visual networks and significantly reduces model parameters as compared to standard AV fusion methods of channel-wise concatenation without loss of performance. We show that AV(SE) <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> with time-based gating across multiple feature layers outperforms baseline methods of single-point, channel-wise concatenated AV fusion on objective evaluations.

0 Replies