Speaker-Adaptive Lipreading via Spatio-Temporal Information Learning

Yi He, Lei Yang, Shi-Lin Wang

Published: 13 Apr 2024, Last Modified: 05 Mar 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Lipreading has been rapidly developed recently with the help of large-scale datasets and large models. Despite the significant progress made, the performance of lipreading models still falls short when dealing with unseen speakers. Therefore, it is necessary to utilize the speaker's videos for fine-tuning to obtain a speaker-adaptive model. However, this approach can result in high overheads, especially for full fine-tuning. To address this problem, we propose a novel parameter-efficient fine-tuning method based on spatio-temporal information learning. In our approach, a low-rank adaptation module which can influence global spatial features and a plug-and-play temporal adaptive weight learning module are designed in the front-end and back-end network, which can adapt to the speaker's unique features such as the shape of the lips and the style of speech, respectively. An Adapter module is added between them to further enhance the spatio-temporal learning. The final experiments on the LRW-ID and GRID datasets demonstrate that our method achieves state-of-the-art performance even with fewer parameters.