Abstract: We propose a massively multilingual speech-to-text neural forced aligner that supports 98 languages with a single architecture. The aligner takes self-supervised discrete acoustic units and unnormalized characters including punctuation marks as inputs. We train the aligner as a part of a non-autoregressive text-to-unit (T2U) model without any external aligner. The T2U model is trained on speech-text paired data in various domains and recording conditions. Experimental evaluation demonstrates that the proposed T2U aligner achieves competitive quality to existing monolingual aligners while supporting much more languages. We also showcase a zero-shot forced alignment capability on unseen languages.
External IDs:dblp:conf/slt/InagumaKNPT24
Loading