Cross-Modal Language Modeling in Multi-Motion-Informed Context for Lip Reading

Xi Ai, Bin Fang

Published: 01 Jan 2023, Last Modified: 10 Jul 2023IEEE ACM Trans. Audio Speech Lang. Process. 2023Readers: Everyone

Abstract: We observe that for lip reading, the language is locally transformed, instead of globally transformed, i.e., speaking and writing follow the same basic grammar rules. In this work, we present a cross-modal language model to tackle the lip-reading challenge on silent videos. Compared to previous works, we consider multi-motion-informed contexts composed of multiple lip-motion representations from different subspaces to guide decoding via the source-target attention mechanism. We present a piece-wise pre-training strategy inspired by multi-task learning to pre-train a visual module to generate multi-motion-informed contexts for cross-modality and pre-train a decoder to generate texts for language modeling. Our final large-scale model outperforms baseline models on four datasets: LRS2, LRS3, LRW, and GRID. We will open our source code on GitHub.

0 Replies