Gabor-based Audiovisual Fusion for Mandarin Chinese Speech Recognition

Yan Xu, Hongce Wang, Zhongping Dong, Yuexuan Li, Andrew Abel

Published: 2022, Last Modified: 11 Nov 2024EUSIPCO 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Audiovisual Speech Recognition (AVSR) is a popular research topic, and incorporating visual features into speech recognition systems has been found to deliver good results. In recent years, end-to-end Convolutional Neural Network (CNN) based deep learning has been widely utilized. However, these often require big data and can be time consuming to train. A lot of speech research also tends to focus on English language datasets. In this paper, we propose a lightweight optimized and automated speech recognition system using Gabor based feature extraction, combined with our Audiovisual Mandarin Chinese (AVMC) corpus. This combines Mel-frequency Cepstral Coefficients (MFCCs) + CNN_Bidirectional Long Short-term Memory (BiLSTM)_Attention (CLA) model for Audio Speech Recognition, and low dimension Gabor visual features + CLA model for Visual Speech Recognition. As we are focusing on Chinese language recognition, we individually analyse initials, finals, and tones, as part of pinyin speech production. The proposed low dimensionality system achieves 88.6%, 87.5% and 93.6% accuracy for tones, initials and finals respectively at char-level, 84.8% for pinyin at word-level.