Speaker Adaptation for Lip-Reading Using Visual Identity Vectors

Pujitha Appan Kandala, Abhinav Thanda, Dilip Kumar Margam, Rohith Chandrashekar Aralikatti, Tanay Sharma, Sharad Roy, Shankar M. Venkatesan

Published: 01 Jan 2019, Last Modified: 17 May 2023INTERSPEECH 2019Readers: Everyone

Abstract: Visual speech recognition or lip-reading suffers from high word error rate (WER) as lip-reading is based solely on articulators that are visible to the camera. Recent works mitigated this problem using complex architectures of deep neural networks. I-vector based speaker adaptation is a well known technique in ASR systems used to reduce WER on unseen speakers. In this work, we explore speaker adaptation of lip-reading models using latent identity vectors (visual i-vectors) obtained by factor analysis on visual features. In order to estimate the visual i-vectors, we employ two ways to collect sufficient statistics: first using GMM based universal background model (UBM) and second using RNN-HMM based UBM. The speaker-specific visual i-vector is given as an additional input to the hidden layers of the lip-reading model during train and test phases. On GRID corpus, use of visual i-vectors results in 15% and 10% relative improvements over current state of the art lip-reading architectures on unseen speakers using RNN-HMM and GMM based methods respectively. Furthermore, we explore the variation of WER with dimension of visual i-vectors, and with the amount of unseen speaker data required for visual i-vector estimation. We also report the results on Korean visual corpus that we created.

0 Replies