Keywords: Visual Speech Recognition, Multimodal Learning, Large Language Model
Abstract: Visual Speech Recognition (VSR) aims to infer what was said by analyzing the speaker's facial dynamics. However, is reliance solely on visual information sufficient in challenging real-world scenarios? In human visual perception, peripheral vision refers to non-central areas of the visual field, crucial for providing overall awareness and detailed perception of central objects. Similarly, human lip-readers do not rely exclusively on lip movements but integrate contextual cues and prior knowledge to achieve more accurate transcribing. For the first time in machine lip-reading, we frame these non-lip-movement factors into a new concept of semantic-level peripheral information, Specifically, we select three representative types varying in relevance to the spoken content: (1) Contextual peripheral information, such as the general topic or some basic knowledge of the speech, can significantly narrow the range of potential recognition hypotheses. (2) Experiential peripheral information emerges from the recognition process itself. The very act of recognizing speech in a specific language provides implicit knowledge of grammar, word collocations, and related linguistic aspects, thereby guiding the recognition effectively. (3) Perturbative peripheral information introduces disturbance factors into the recognition process, analogous to noise injection in visual tasks. Semantic-level peripheral information is indirectly linked to transcripts; thus fusing it into VSR necessitates strong contextual understanding and inference capabilities. Here, we propose a multimodal learning framework built on a large language model (LLM), leveraging its powerful contextual modeling capabilities to take advantage of peripheral information. Our method's efficacy is demonstrated on two popular datasets. On the widely-used LRS3 dataset, we achieved a Word Error Rate (WER) of 24.5\% with readily available peripheral information, leading to an impressive 14.3\% relative improvement over the model without such information. To the best of our knowledge, our work sets a new state-of-the-art when utilizing similar hours of lip-reading videos. We further reported the evaluation on the more challenging AVSpeech dataset. Results across both datasets and various experimental settings demonstrate the promising potential of the proposed semantic-level peripheral information for VSR.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13173
Loading