Can Audio-Visual Speech Recognition Outperform Acoustically Enhanced Speech Recognition in Automotive Environment?

Rajitha Navarathna, Tristan Kleinschmidt, David Dean, Sridha Sridharan, Patrick Lucey

Published: 2011, Last Modified: 06 Feb 2025INTERSPEECH 2011EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The use of visual features in the form of lip movements to improve the performance of acoustic speech recognition has been shown to work well, particularly in noisy acoustic conditions. However, whether this technique can outperform speech recognition incorporating well-known acoustic enhancement techniques, such as spectral subtraction, or multi-channel beamforming is not known. This is an important question to be answered especially in an automotive environment, for the design of an efficient human-vehicle computer interface. We perform a variety of speech recognition experiments on a challenging automotive speech dataset and results show that synchronous HMM-based audio-visual fusion can outperform traditional single as well as multi-channel acoustic speech enhancement techniques. We also show that further improvement in recognition performance can be obtained by fusing speech-enhanced audio with the visual modality, demonstrating the complementary nature of the two robust speech recognition approaches.