Massively Multilingual Forced Aligner Leveraging Self-Supervised Discrete Units

Hirofumi Inaguma, Ilia Kulikov, Zhaoheng Ni, Sravya Popuri, Paden Tomasello

Published: 01 Jan 2024, Last Modified: 07 Nov 2025SLT 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We propose a massively multilingual speech-to-text neural forced aligner that supports 98 languages with a single architecture. The aligner takes self-supervised discrete acoustic units and unnormalized characters including punctuation marks as inputs. We train the aligner as a part of a non-autoregressive text-to-unit (T2U) model without any external aligner. The T2U model is trained on speech-text paired data in various domains and recording conditions. Experimental evaluation demonstrates that the proposed T2U aligner achieves competitive quality to existing monolingual aligners while supporting much more languages. We also showcase a zero-shot forced alignment capability on unseen languages.

External IDs:dblp:conf/slt/InagumaKNPT24