Multimodal Integration in Audio-Visual Speech Recognition --- How Far Are We From Human-Level Robustness?
Keywords: Multimodal integration, AVSR, Human performance, robustness
TL;DR: State-of-the-art audiovisual speech recognition model are much less robust than humans to visual noise
Abstract: This paper introduces a novel evaluation framework, inspired by methods from human psychophysics, to systematically assess the robustness of multimodal integration in audiovisual speech recognition (AVSR) models relative to human abilities. We present preliminary results on AV-HuBERT suggesting that multimodal integration in state-of-the-art (SOTA) AVSR models remains mediocre when compared to human performance and we discuss avenues for improvement.
Submission Number: 86
Loading