Capturing Dynamic Identity Features for Speaker-Adaptive Visual Speech Recognition

Sara Kashiwagi, Keitaro Tanaka, Shigeo Morishima

Published: 2024, Last Modified: 05 Mar 2025APSIPA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper describes a multi-task learning method to improve speaker adaptation in visual speech recognition (VSR). VSR models are highly sensitive to variations in lip movements, resulting in degraded accuracy for speakers not encountered during training. A typical solution is to fine-tune pre-trained models with minimal data from the target speaker, but this often faces overfitting to the specific speech content of those samples. Effective speaker adaptation requires VSR models to learn both static and dynamic aspects of individual lip features independently of speech content. However, the dynamic features are time-variant and intertwined with similarly time-variant content information. To address this issue, we introduce an additional task into the fine-tuning process that encourages the model to focus on acquiring disentangled dynamic identity features. Specifically, we apply temporal transformations to the latent visual representations and input them together with the original ones into a dynamic identity discriminator. The discriminator determines whether each entry is original or transformed, where both sets of representations share the same static identity features and speech content, thereby promoting the desired speaker adaptation. Our evaluation demonstrates that the proposed method improves recognition accuracy for all speakers across two different datasets: public online speech videos and our private recordings using a smartphone camera.