Not Only Vision: Evolve Visual Speech Recognition via Peripheral Information
Abstract: Is visual information alone sufficient for visual speech
recognition (VSR) in challenging real-world scenarios?
Humans do not rely solely on visual information for lipreading but also incorporate additional cues, such as
speech-related context and prior knowledge about the task.
However, existing methods have largely overlooked such external information in automatic VSR systems. To systematically explore the role of such information for VSR, we
introduce the concept of Peripheral Information. We categorize it into three types based on the relevance to the
spoken content: (1) Contextual Guidance (e.g., topic or description of speech), (2) Task Expertise (e.g., human prior
experience in lip-reading), and (3) Linguistic Perturbation
(irrelevant signals processed alongside meaningful information). Considering the disparity that peripheral information provides additional clues with varying significance
while visual input serves as the most direct source for VSR,
we propose a framework that introduces a hierarchical processing strategy to handle different modalities. With visualspecific adaptation and a dynamic routing mechanism for
multi-modal information, our approach reduces the impact
of modality conflicts effectively and enables selective utilization of peripheral information with varying relevance.
Leveraging readily available peripheral information, our
model achieves a WER of 22.03% on LRS3. Further experiments on AVSpeech demonstrate its generalization in realworld scenarios.
Loading