Data Augmentation Methods on Ultrasound Tongue Images for Articulation-to-Speech SynthesisDownload PDF

Published: 16 Jun 2023, Last Modified: 28 Jun 2023SSW12Readers: Everyone
Keywords: data augmentation, silent speech interfaces, ultrasound tongue imaging, articulation-to-speech synthesis
TL;DR: Articulation-to-Speech Synthesis focuses on converting articulatory biosignal information into audible speech, within which, data augmentation of ultrasound images is proposed.
Abstract: Articulation-to-Speech Synthesis (ATS) focuses on converting articulatory biosignal information into audible speech, nowadays mostly using DNNs, with a future target application of a Silent Speech Interface. Ultrasound Tongue Imaging (UTI) is an affordable and non-invasive technique that has become popular for collecting articulatory data. Data augmentation has been shown to improve the generalization ability of DNNs, e.g. to avoid overfitting, introduce variations into the existing dataset, or make the network more robust against various noise types on the input data. In this paper, we compare six different data augmentation methods on the UltraSuite-TaL corpus during UTI-based ATS using CNNs. Validation mean squared error is used to evaluate the performance of CNNs, while by the synthesized speech samples, the performace of direct ATS is measured using MCD and PESQ scores. Although we did not find large differences in the outcome of various data augmentation techniques, the results of this study suggest that while applying data augmentation techniques on UTI poses some challenges due to the unique nature of the data, it provides benefits in terms of enhancing the robustness of neural networks. In general, articulatory control might be beneficial in TTS as well.
3 Replies

Loading