Not Only Vision: Evolve Visual Speech Recognition via Peripheral Information

Zhaoxin Yuan, Shuang Yang, Shiguang Shan, Xilin CHEN

Published: 22 Oct 2025, Last Modified: 28 Feb 2026OpenReview Archive Direct UploadEveryoneCC BY-SA 4.0

Abstract: Is visual information alone sufficient for visual speech recognition (VSR) in challenging real-world scenarios? Humans do not rely solely on visual information for lipreading but also incorporate additional cues, such as speech-related context and prior knowledge about the task. However, existing methods have largely overlooked such external information in automatic VSR systems. To systematically explore the role of such information for VSR, we introduce the concept of Peripheral Information. We categorize it into three types based on the relevance to the spoken content: (1) Contextual Guidance (e.g., topic or description of speech), (2) Task Expertise (e.g., human prior experience in lip-reading), and (3) Linguistic Perturbation (irrelevant signals processed alongside meaningful information). Considering the disparity that peripheral information provides additional clues with varying significance while visual input serves as the most direct source for VSR, we propose a framework that introduces a hierarchical processing strategy to handle different modalities. With visualspecific adaptation and a dynamic routing mechanism for multi-modal information, our approach reduces the impact of modality conflicts effectively and enables selective utilization of peripheral information with varying relevance. Leveraging readily available peripheral information, our model achieves a WER of 22.03% on LRS3. Further experiments on AVSpeech demonstrate its generalization in realworld scenarios.