Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX
Submission Type: Regular Short Paper
Submission Track: Speech and Multimodality
Submission Track 2: Multilinguality and Linguistic Diversity
Keywords: audio-visual speech recognition, speech recognition, multimodal, multilingual
TL;DR: Design a multilingual audio-visual speech recognition with a single trained model
Abstract: We present a novel approach to multilingual audio-visual speech recognition tasks by introducing a single model on a multilingual dataset. Motivated by a human cognitive system where humans can intuitively distinguish different languages without any conscious effort or guidance, we propose a model that can capture which language is given as an input speech by distinguishing the inherent similarities and differences between languages. To do so, we design a prompt fine-tuning technique into the largely pre-trained audio-visual representation model so that the network can recognize the language class as well as the speech with the corresponding language. Our work contributes to developing robust and efficient multilingual audio-visual speech recognition systems, reducing the need for language-specific models.
Submission Number: 138
Loading