Abstract: In this paper, we present a method of enhancing automatic speech recognition dataset with an immature pre-trained model and script. Comparing the chunks obtained from the pre-trained model with the ground truth script, we produce the pair of an audio and its script. In each pair, the audio has exact beginning and end of an utterance, and the script is clear since we use the human-written script. In the experiments on news videos and scripts, it is shown that our method extract automatic speech recognition dataset in exact and effective manner. In addition, the new dataset can be used to train speech synthesizing model.
Loading