LARGE SCALE DEEP NEURAL NETWORK ACOUSTIC MODELING WITH SEMI-SUPERVISED TRAINING DATA FOR YOUTUBE VIDEO TRANSCRIPTION

Hank Liao, Erik McDermott, Andrew Senior

20 Jul 2020OpenReview Archive Direct UploadReaders: Everyone

Abstract: YouTube is a highly visited video sharing website where over one billion people watch six billion hours of video every month. Im- proving accessibility to these videos for the hearing impaired and for search and indexing purposes is an excellent application of automatic speech recognition. However, YouTube videos are extremely chal- lenging for automatic speech recognition systems. Standard adapted Gaussian Mixture Model (GMM) based acoustic models can have word error rates above 50%, making this one of the most difficult reported tasks. Since 2009, YouTube has provided automatic gener- ation of closed captions for videos detected to have English speech; the service now supports ten different languages. This paper de- scribes recent improvements to the original system, in particular the use of owner-uploaded video transcripts to generate additional semi- supervised training data and deep neural networks acoustic models with large state inventories. Applying an “island of confidence” fil- tering heuristic to select useful training segments, and increasing the model size by using 44,526 context dependent states with a low- rank final layer weight matrix approximation, improved performance by about 13% relative compared to previously reported sequence trained DNN results for this task.

0 Replies