A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition

Published: 2024, Last Modified: 29 Jan 2026CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Advanced Audio- Visual Speech Recognition (AVSR) sys-tems have been observed to be sensitive to missing video frames, performing even worse than single-modality mod-els. While applying the common dropout techniques to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. In this study, we delve into this contrasting phenomenon through the lens of modality bias and uncover that an excessive modality bias towards the audio modality induced by dropout constitutes the fun-damental cause. Next, we present the Modality Bias Hy-pothesis (MBH) to systematically describe the relationship between the modality bias and the robustness against missing modality in multimodal systems. Building on these findings, we propose a novel Multimodal Distribution Approxi-mation with Knowledge Distillation (MDA-KD)framework to reduce over-reliance on the audio modality, maintaining performance and robustness simultaneously. Finally, to address an entirely missing modality, we adopt adapters to dynamically switch decision strategies. The effective-ness of our proposed approach is evaluated through comprehensive experiments on the MISP2021 and MISP2022 datasets. Our code is available at https://github.com/dalision/ModalBiasAV5R.
Loading