Multimodal Integration in Audio-Visual Speech Recognition --- How Far Are We From Human-Level Robustness?

Published: 10 Oct 2024, Last Modified: 01 Nov 2024NeurIPS 2024 Workshop on Behavioral MLEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal integration, AVSR, Human performance, robustness
TL;DR: State-of-the-art audiovisual speech recognition model are much less robust than humans to visual noise
Abstract: This paper introduces a novel evaluation framework, inspired by methods from human psychophysics, to systematically assess the robustness of multimodal integration in audiovisual speech recognition (AVSR) models relative to human abilities. We present preliminary results on AV-HuBERT suggesting that multimodal integration in state-of-the-art (SOTA) AVSR models remains mediocre when compared to human performance and we discuss avenues for improvement.
Submission Number: 86
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview