AMG-AVSR: Adaptive Modality Guidance for Audio-Visual Speech Recognition via Progressive Feature Enhancement

Published: 05 Sept 2024, Last Modified: 16 Oct 2024ACML 2024 Conference TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AV-HuBERT, AVSR, Compression and recovery, Curriculum learning
Verify Author List: I have double-checked the author list and understand that additions and removals will not be allowed after the submission deadline.
Abstract: Audio-Visual Speech Recognition (AVSR) is a task that identifies spoken words by analyzing both lip movements and auditory signals. Compared to Automatic Speech Recognition (ASR), AVSR demonstrates greater robustness in noisy environments due to the support of dual modalities. However, the inherent differences between these modalities present a challenge: effectively accounting for their disparities and leveraging their complementary information to extract useful information for AVSR. To address this, we propose the AMG-AVSR model, which utilizes a two-stage curriculum learning strategy and incorporates a feature compression and recovery mechanism. By leveraging the characteristics of different modalities in various scenarios to guide each other, the model extracts refined features from audio-visual data, thereby enhancing recognition performance in both clean and noisy environments. Compared to the baseline model AV-HuBERT, AMG-AVSR demonstrates superior performance on the LRS2 dataset in both noisy and clean environments. AMG-AVSR achieves a word error rate (WER) of 2.9% under clean speech conditions. In various noisy conditions, AMG-AVSR shows a significant reduction in WER compared to previous methods.
A Signed Permission To Publish Form In Pdf: pdf
Primary Area: Deep Learning (architectures, deep reinforcement learning, generative models, deep learning theory, etc.)
Paper Checklist Guidelines: I certify that all co-authors of this work have read and commit to adhering to the guidelines in Call for Papers.
Student Author: Yes
Submission Number: 284
Loading