Audio-Visual Speech Recognition In-The-Wild: Multi-Angle Vehicle Cabin Corpus and Attention-Based Method

Alexandr Axyonov, Dmitry Ryumin, Denis Ivanko, Alexey M. Kashevnik, Alexey Karpov

Published: 01 Jan 2024, Last Modified: 28 Mar 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Audio-visual speech recognition (AVSR) gains increasing attention as an important part of human-machine interaction. However, the publicly available corpora are limited, particularly in driving conditions with prevalent background noise. Research so far has been collected in constrained environments, and thus cannot reflect the true performance of AVSR systems in real-world scenarios. Moreover, data for languages other than English is often unavailable. To meet the request for research on AVSR in unconstrained driving conditions, this paper presents a corpus collected ‘in-the-wild’. We propose a cross-modal attention method enhancing multi-angle AVSR for vehicles, leveraging visual context to improve accuracy and noise robustness. Our proposed model achieves state-of-the-art (SOTA) results with 98.65% accuracy in recognizing driver voice commands. For more details, visit our project page1.