Enhancing Mandarin Lip Reading with a Multimodal Conformer and Structured State-Space Decoder

Meng Miao, Feilong Bao, Guanglai Gao

Published: 2025, Last Modified: 13 Mar 2026IJCNN 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Lip reading is a visual recognition technology that interprets spoken content by decoding lip movements. Since speech perception is inherently a multimodal task, incorporating audio information during training is crucial to assist lip reading. This paper proposes a novel architecture that combines the Conformer network with a structured state space decoder under multimodal input to enhance Mandarin lip reading capabilities. As a tonal language, Mandarin benefits from audio cues that guide visual information learning, improving the accuracy of speech content recognition. Our approach leverages the Conformer to extract shared semantics from both audio and video, and employs a bidirectional structured state space decoder to decode, effectively capturing the temporal dynamics and complex dependencies of long sequences. This method achieved CERs of 54.97% and 12.53% in the CN-CVS and CMLR datasets, respectively. The research code is open source at: https://anonymous.4open.science/r/Lip-reading-model-D7B8.
Loading