CM^2: Cross-Modal Contextual Modeling for Audio-Visual Speech Enhancement

Feixiang Wang; Shuang Yang; Shiguang Shan; Xilin Chen

CM^2: Cross-Modal Contextual Modeling for Audio-Visual Speech Enhancement

Feixiang Wang, Shuang Yang, Shiguang Shan, Xilin Chen

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Speech Enhancement, Audio-Visual, Contextual Modeling

TL;DR: This paper introduces a novel framework, Cross-Modal Contextual Modeling (CM$^2$), which integrates semantic and signal-level contextual information across audio and visual modalities to enhance speech quality in noisy environments.

Abstract: Audio-Visual Speech Enhancement (AVSE) aims to improve speech quality in noisy environments by utilizing synchronized audio and visual cues. In real-world scenarios, noise is often non-stationary, interfering with speech signals at varying intensities over time. Despite these fluctuations, humans can discern and understand masked spoken words as if they were clear. This capability stems from the auditory system's ability to perceptually reconstruct interrupted speech using visual cues and semantic context in noisy environments, a process known as phonemic restoration. Inspired by this phenomenon, we propose Cross-Modal Contextual Modeling (CM$^2$), integrating contextual information across different modalities and levels to enhance speech quality. Specifically, we target two types of contextual information: semantic-level context and signal-level context. Semantic-level context enables the model to infer missing or corrupted content by leveraging semantic consistency across segments. Signal-level context further explores coherence within the signals developed from the semantic consistency. Additionally, we particularly highlight the role of visual appearance in modeling the frequency-domain characteristics of speech, aiming to further refine and enrich the expression of these contexts. Guided by this understanding, we introduce a Semantic Context Module (SeCM) at the very beginning of our framework to capture the initial semantic contextual information from both audio and visual modalities. Next, we propose a Signal Context Module (SiCM) to obtain signal-level contextual information from both raw noisy audio signal and the previously acquired audio-visual semantic-level context. Building on this rich contextual information, we finally introduce a Cross-Context Fusion Module (CCFM) to facilitate fine-grained context fusion across different modalities and types of contexts for further speech enhancement process. Comprehensive evaluations across various datasets demonstrate that our method significantly outperforms current state-of-the-art approaches, particularly in low signal-to-noise ratio (SNR) environments.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4079

Loading