Abstract: Visual Speech Recognition (VSR), or lip reading, is essential in scenarios where audio signals are absent or degraded. In this paper, we introduce an end-to-end framework that integrates visual representations into a pretrained Large Language Model (LLM), enabling transcription that leverages multimodal context for improved accuracy and robustness. The core of our method is a cross-modal attention module that establishes fine-grained alignment between audio and visual streams during training, paired with lightweight adapters for seamless multimodal integration. At inference, the model relies solely on visual data, benefiting from audio-guided learning to enhance transcription accuracy. The proposed framework enables robust adaptation across varied linguistic conditions, yielding superior generalization and performance. Our experiments across Latin-script languages demonstrate consistent improvements over the current state of the art, yielding 1.53%–3.83% absolute reductions in WER. Experiments on Romanian, a previously unseen language, reveal strong zero-shot generalization and significant improvements after fine-tunning.
External IDs:dblp:conf/cbmi/TapuM25
Loading