The Collaboration of 3D Convolutions and CRO-TSM in Lipreading

Published: 2024, Last Modified: 12 Jun 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Lip reading refers to the recognition of speech solely based on the subtle movements of the lips without audio information. Extracting temporal information in lip reading has always been a challenge in this field. In this work, we propose an effective method for extracting temporal information. Specifically, we make the following contributions: Firstly, We propose a new approach called cro-TSM, which utilizes different channel ratios for temporal shifting based on the existing TSM(Temporal Shift Module). Secondly, we replace the global average pooling of the ResNet with 3D convolutions, which work in collaboration with cro-TSM to extract additional temporal information. Lastly, we apply this method to the state-of-the-art models and achieve a remarkable accuracy of 92.4% on the Lipreading In-The-Wild (LRW) dataset. Our approach surpasses all baseline methods and achieves a new state-of-the-art performance in Lipreading.
Loading