MESRS: Models Ensemble Speech Recognition System

Ben Zagagy, Maya Herman

Published: 01 Jan 2020, Last Modified: 12 May 2023SAI (2) 2020Readers: Everyone

Abstract: Speech recognition (SR) technology is used to recognize spoken words and phonemes recorded in audio and video files. This paper presents a novel method for SR, based on our implementation for an ensemble of multiple deep learning (DL) models with different architectures. Contrary to standard SR systems, we ensemble the most commonly used DL architectures followed by dynamic weighted averages, in order to classify audio clips correctly. Models’ training is performed using conversion of audio signals from the audio space into the image space. We used the converted images as training input for the models. This way, most of the default parameters originally used for training models using images, can also be used for training our models. We show that the combination of space conversion and models ensemble can achieve high accuracy results. This paper has 2 main objectives. The first - represent a new trend of definition by expanding the DL process for the audio space. The second - present a new platform for deep learning models ensemble using weighted averages. Previous works in this field tend to stay in the comfort zone of a single DL architecture, fine-tuned to capture all edge cases. We show that applying dynamic weighted average over multiple architectures can improve the final classification results significantly. Since models that classify high pitch audio might not be as good in classifying low pitch audio and vice versa, we harness the capabilities of multiple architectures in order to handle the various edge cases.

0 Replies