Abstract: End-to-end automatic lip-reading usually comprises an encoder-decoder model and an optional external language model. In this work, we introduce two regularization methods to the field of lip-reading: First, we apply the regularized dropout (R-Drop) method to transformer-based lip-reading to improve their training-inference consistency. Second, the relaxed attention technique is applied during training for a better external language model integration. We are the first to show that these two complementary approaches yield particu1arly strong performance if combined in the right manner. In particular, by adding an additional R - Drop loss and smoothing the attention weights in cross multi-head attention during training only, we achieve a new state of the art with a word error rate of 22.2% on Lip Reading Sentences 2 (LRS2). On LRS3, we are 2nd ranked with 25.5% WER using only 1,759 h of training data, while the 1 st rank uses about 90,000 h. Our code is available at GitHub.11https://github.com/ifnspaml/Lipreading-RDrop-RA
Loading