Boosting Lip Reading with a Multi-View Fusion Network

Xueyi Zhang, Chengwei Zhang, Jinping Sui, Changchong Sheng, Wanxia Deng, Li Liu

Published: 01 Jan 2022, Last Modified: 16 Apr 2024ICME 2022Readers: Everyone

Abstract: Lip reading aims to decode speech information by analyzing lip movement without involving audio. Numerous deep learning based methods are proposed to address this task. Generally, most existing methods extract visual features only based on the lip appearance, while ignoring the shape dynamic information of the lip region. Motivated by this, we propose a Multi-View Fusion Network (MVFN), which can extract more discriminative visual representations by incorporating appearance and shape information. Besides, a novel adaptive graph convolutional network model called Adaptive Spatial Graph Model(ASGM) is proposed to learn lip spatial topology and lip shape dynamics automatically. Experiments on LRW (word-level) and OuluVS2 (phrase-level) clearly show that the proposed method significantly outperforms the baseline methods by a large margin and achieves state-of-the-art performance.

0 Replies