Abstract: Lip reading is a visual recognition technology that interprets spoken content by decoding lip movements. Since speech perception is inherently a multimodal task, incorporating audio information during training is crucial to assist lip reading. This paper proposes a novel architecture that combines the Conformer network with a structured state space decoder under multimodal input to enhance Mandarin lip reading capabilities. As a tonal language, Mandarin benefits from audio cues that guide visual information learning, improving the accuracy of speech content recognition. Our approach leverages the Conformer to extract shared semantics from both audio and video, and employs a bidirectional structured state space decoder to decode, effectively capturing the temporal dynamics and complex dependencies of long sequences. This method achieved CERs of 54.97% and 12.53% in the CN-CVS and CMLR datasets, respectively. The research code is open source at: https://anonymous.4open.science/r/Lip-reading-model-D7B8.
External IDs:dblp:conf/ijcnn/MiaoBG25
Loading