On Robustness to Missing Video for Audiovisual Speech Recognition

Oscar Chang; Otavio Braga; Hank Liao; Dmitriy Serdyuk; Olivier Siohan

On Robustness to Missing Video for Audiovisual Speech Recognition

Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, Olivier Siohan

Published: 11 Aug 2022, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g. the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model. While there have been many attempts at building robust models, there is little consensus on how robustness should be evaluated. To address this, we introduce a framework that allows claims about robustness to be evaluated in a precise and testable way. We also conduct a systematic empirical study of the robustness of common audiovisual speech recognition architectures on a range of acoustic noise conditions and test suites. Finally, we show that an architecture-agnostic solution based on cascades can consistently achieve robustness to missing video, even in settings where existing techniques for robustness like dropout fall short.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: We made changes requested by the reviewers and the action editors.

Assigned Action Editor: ~Xu_Tan1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 141

Loading