LipMVCL:Lipreading Based on Multi-view and Collaborative Learning

Published: 01 Jan 2024, Last Modified: 11 Apr 2025CCBR (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Lipreading aims to decode the corresponding text by analyzing lip movements. Existing lipreading models typically employ 2D or 3D CNNs as feature extraction. 2D CNNs focus on spatial features of individual frames, while 3D CNNs captures the spatiotemporal features of the video. By leveraging both types of networks to extract features, we can achieve a multi-perspective mapping of the visual modality. This paper proposes an end-to-end multi-view collaborative learning lipreading model (LipMVCL), which promotes the comprehensive utilization of visual modality features and improves the model’s predictive capabilities. The LipMVCL comprises two branches, one using 2D CNNs and the other using 3D CNNs for visual feature extraction. During decoding, the model incorporates a collaborative learning module to facilitate information exchange between the two branches. Specifically, the predictions from each branch are updated using both backend textual supervision and the predictions from the other branch as an additional supervisory signal. Experimental results on CMLR and GRID datasets demonstrate that our approach outperforms state-of-the-art approaches.
Loading