Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts

Dongjie Fu; Xize Cheng; Xiaoda Yang; Hanting Wang; Zhou Zhao; Tao Jin

Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts

Dongjie Fu, Xize Cheng, Xiaoda Yang, Hanting Wang, Zhou Zhao, Tao Jin

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In the burgeoning field of Audio-Visual Speech Recognition (AVSR), extant research has predominantly concentrated on the training paradigms tailored for high-quality resources. However, owing to the challenges inherent in real-world data collection, audio-visual data are frequently affected by modality-distortion, which encompasses audio-visual asynchrony, video noise and audio noise. The recognition accuracy of existing AVSR method is significantly compromised when multiple modality-distortion coexist in low-resource data. In light of the above challenges, we propose PCD: Cluster-Prompt with Contrastive Decomposition, a robust framework for modality-distortion speech recognition, specifically devised to transpose the pre-trained knowledge from high-resource domain to the targeted domain by leveraging contrast-augmented prompts. In contrast to previous studies, we take into consideration the possibility of various types of distortion in both the audio and visual modalities. Concretely, we design bespoke prompts to delineate each modality-distortion, guiding the model to achieve speech recognition applicable to various distortion scenarios with quite few learnable parameters. To materialize the prompt mechanism, we employ multiple cluster-based strategies that better suits the pre-trained audio-visual model. Additionally, we design a contrastive decomposition mechanism to restrict the explicit relationships among various modality conditions, given their shared task knowledge and disparate modality priors. Extensive results on LRS2 dataset demonstrate that PCD achieves state-of-the-art performance for audio-visual speech recognition under the constraints of distorted resources.

Primary Subject Area: [Content] Media Interpretation

Secondary Subject Area: [Content] Vision and Language

Relevance To Conference: In this paper, we focus primarily on the audio-visual speech recognition (AVSR) task, which aims to translates synchronized audio and video into corresponding text. This task is currently a focal point in the multimodal domain. Concretely, we propose a novel framework, PCD, which is the first work dedicated to enhancing robustness in modality-distortion AVSR. We design two cluster-based strategies tailored for implementing the prompt mechanism, which are especially optimized to complement pre-trained audio-visual models. Furthermore, we introduce a novel contrastive decomposition mechanism for prompts, aiming to mine the interactions between diverse modality-distortion conditions. Our proposed method PCD achieves the state-of-the-art performance on the LRS2 dataset, demonstrating its outstanding efficacy in AVSR tasks involving modality-distortion.

Supplementary Material: zip

Submission Number: 3503

Loading