Unsupervised Mismatch Localization in Cross-Modal Sequential Data with Application to Mispronunciations Localization

Published: 15 Dec 2022, Last Modified: 28 Feb 2023Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Content mismatch usually occurs when data from one modality is translated to another, e.g. language learners producing mispronunciations (errors in speech) when reading a sentence (target text) aloud. However, most existing alignment algorithms assume that the content involved in the two modalities is perfectly matched, thus leading to difficulty in locating such mismatch between speech and text. In this work, we develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal sequential data, especially for speech-text sequences. More specifically, we propose a hierarchical Bayesian deep learning model, dubbed mismatch localization variational autoencoder (ML-VAE), which decomposes the generative process of the speech into hierarchically structured latent variables, indicating the relationship between the two modalities. Training such a model is very challenging due to the discrete latent variables with complex dependencies involved. To address this challenge, we propose a novel and effective training procedure that alternates between estimating the hard assignments of the discrete latent variables over a specifically designed mismatch localization finite-state acceptor (ML-FSA) and updating the parameters of neural networks. In this work, we focus on the mismatch localization problem for speech and text, and our experimental results show that ML-VAE successfully locates the mismatch between text and speech, without the need for human annotations for model training.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=Ktr9oWhFYv
Changes Since Last Submission: ### 10 Nov 2022 We appreciate the thorough comments and suggestions from all the reviewers. We have revised our paper based on your feedback and the revised content is highlighted in ***red***. Here are the major changes that have been made: 1. The graphical model has been revised to present $b_t$ correctly. The description of the boundary detector has also been revised accordingly. 1. An additional paragraph has been added to the end of Section 4 "Model" to demonstrate the model architecture. 1. In Section 5 "Mismatch Localization Finite-State Acceptor", some details are added to explain how the weights are estimated. 1. The estimation process of $\hat {\mathcal B}$ and $\hat \Pi$ has been fully revised. 1. Section 7 "ML-VAE with REINFORCE Algorithm" has been revised to correct the typos and add more discussions of the REINFOCE algorithm. 1. Some alignment experimental results are added to Appendix D "Experimental Results of the Alignment Task" to demonstrate how traditional alignment methods would fail on inputs with mispronunciations. Some other small changes have also been made to address the issues raised by the reviewers. --- ### 25 Nov 2022 We thank the reviewers again for the valuable further feedback and comments. We have accordingly further revised our paper based on their latest suggestions and questions. The second-round revised content is highlighted in ***brown*** (the first-round revision was highlighted in red). Here are some major changes we have made. 1. Add Appendix A to explain how the posterior is derived. 2. Fix some typos and address some minor issues mentioned by the reviewers. --- ### 14 Dec 2022 We addressed the question raised by the action editor in the future work section and uploaded the camera-ready version of our paper.
Assigned Action Editor: ~Brian_Kingsbury1
Submission Number: 477