Abstract: While speech recognition has become highly robust in the recent past, it is still a challenging task under very noisy or reverberant conditions. Augmenting speech recognition by lipreading from video input is hence a promising approach to make speech recognition more reliable. For this purpose, we consider slow feature analysis (SFA), an unsupervised machine learning method that finds temporally slowest varying features in sequential input data. It can automatically extract temporally slow features within a video sequence, such as lip movements, while at the same time removing quickly changing components such as noise. In this work, we apply SFA as an initial feature extraction step to the task of automatic lipreading. The performance is evaluated on small-vocabulary lipreading, both in the speaker-dependent and speaker-independent case, showing that the features are competitive to the often highly successful combination of a discrete cosine transform and a linear discriminant analysis, while also offering good interpretability.
0 Replies
Loading