WPELip: enhance lip reading with word-prior information

Published: 01 Jan 2025, Last Modified: 11 Apr 2025Multim. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The conventional lipreading pipeline typically involves a frontend visual feature encoder to extract video features, followed by a backend sequence decoder to decode these features into text output. The whole network is only trained through supervision on the text output. Hence, due to the depth of the network layer, the supervised signal from the backend text sequence decoding, that is, the error between the label text and the predicted text, is usually too far away from the frontend visual feature encoder, resulting in a less refined visual feature encoder and sub-optimal lipreading performance. To address this issue, a novel lipreading strategy with intermediate supervision is proposed. First, a cross attention module between video and dictionary is added between frontend and backend, and a frame loss is proposed to provide direct supervision signals for the frontend visual feature encoder. The cross-attention mechanism enables the proposed module to extract a set of frame-level word prior cues from the dictionary for each video frame. Hence, this module is designated as the Word-Prior Enhancement Module. Second, a Temporal-level Feature Fusion Module is proposed to get the fusion features of word prior cues and video features, which capture the temporal dependencies between word prior cues and video features. Finally, the fusion features are used as input to the backend sequence decoder. Extensive experiments on the CMLR and GRID datasets clearly show that our method is superior to the existing methods in reducing error rates, and verify its effectiveness in improving lipreading performance.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview