Abstract: Lipreading is a task that involves analyzing and learning patterns and features of lip and mouth movements, extracting relevant information from videos, and converting it into understandable language or text. Correctly interpreting key frames in videos is crucial for the success of lipreading. In existing lipreading tasks, there has been limited research on the combination of temporal modeling with long-range dependencies and key frames, leading to the misclassification of phonetically similar words. In this work, we explore the integration of key frames into long-range dependency information and propose a joint perception temporal convolutional network (JP-TCN) for word-level lipreading. Specifically, a multi-scale feature extraction module is employed to extract local dynamic information and long-range dependency information from the front-end image features. Subsequently, these two sets of information are fed into an adaptive fusion module that weights and combines the local dynamic information and long-range dependency information at each time position, enabling the model to enhance key frames learning at different levels. Without any additional bells and whistles, the proposed JP-TCN achieves a recognition accuracy of 89.92% on the LRW dataset.
External IDs:dblp:conf/msn/LiTCW23
Loading