An Asynchronous DBN for Audio-Visual speech Recognition

Kate Saenko, Karen Livescu

2006 (modified: 08 Nov 2022)SLT 2006Readers: Everyone

Abstract: We investigate an asynchronous two-stream dynamic Bayesian network-based model for audio-visual speech recognition. The model allows the audio and visual streams to de-synchronize within the boundaries of each word. The probability of de-synchronization by a given number of states is learned during training. This type of asynchrony has been previously used for pronunciation modeling and for visual speech recognition (lipreading); however, this is its first application to audiovisual speech recognition. We evaluate the model on an audiovisual corpus of English digits (CUAVE) with different levels of added acoustic noise, and compare it to several baselines. The asynchronous model outperforms audio-only and synchronous audio-visual baselines. We also compare models with different degrees of allowed asynchrony and find that the lowest error rate on this task is achieved when the audio and visual streams are allowed to de-synchronize by up to two states.

0 Replies