Abstract: Ultrasound tongue imaging is widely used in clinical linguistics and phonetics. Recently, deep neural networks, especially convolutional neural networks, have been widely used in the interpretation and analysis of ultrasound tongue images (UTI). Despite achieving satisfactory performance, deep models rely on a large amount of manually labeled data, which is often difficult to obtain in practical settings. To address this issue, this paper focuses on how to utilize a large amount of unlabeled UTI data to improve the performance of UTI classification task. Specifically, we explore self-supervised learning with masking modeling strategy. By predicting the masked part, our pre-trained model enables the neural network to infer contextual information. Then, we fine-tune the pre-trained model with a small amount of labeled data. Compared with the previous competing algorithms, our method can improve the classification accuracy by an average of 13.33% in four different scenarios.
Loading