Abstract: Visual Speech Recognition (lip-reading) has witnessed tremendous improvements, reaching word error rates as low as 12.8 WER in English. However, the performance in other languages is lagging far behind, due to the lack of labeled multilingual video data. In this work, we reduce the performance gap with the help of three key advances: (i) introducing the largest multilingual lip-reading dataset to date, (ii) proposing a single multi-task architecture that can perform two tasks simultaneously: identify the language and transcribe the utterance, and (iii) jointly training this architecture on all the languages together, resulting in large WER improvements as opposed to training monolingual models separately. We achieve state-of-the-art performance in both visual language identification and multilingual lip-reading tasks. Moreover, our pipeline uses zero manual annotations, as all the training transcriptions are obtained using a pre-trained ASR model. We also show that our multilingual model can be readily fine-tuned for new low-resource languages on which models trained from scratch do not converge. Our data, code, and models are available at: www.robots.ox.ac.uk/∼vgg/research/multivsr.
External IDs:dblp:conf/icassp/PrajwalHZ25
Loading