Automated Tone Transcription and Clustering with Tone2Vec

ACL ARR 2024 June Submission66 Authors

05 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Lexical tones play a crucial role in Sino-Tibetan languages. However, current phonetic fieldwork relies on manual effort, resulting in substantial time and financial costs. This is especially challenging for the numerous endangered languages that are rapidly disappearing, often exacerbated by limited funding. In this paper, we introduce pitch-based similarity representations for tone transcription, named \texttt{Tone2Vec}. Experiments on dialect clustering and variance show that \texttt{Tone2Vec} effectively captures fine-grained tone variation. Utilizing \texttt{Tone2Vec}, we develop the first automatic approach for tone transcription and clustering by presenting a novel representation transformation for transcriptions. Additionally, these algorithms are systematically integrated into an open-sourced and easy-to-use package, \texttt{ToneLab}, which facilitates automated fieldwork and cross-regional, cross-lexical analysis for tonal languages. Extensive experiments were conducted to demonstrate the effectiveness of our methods. Experiment implementations are available at \href{https://anonymous.4open.science/r/Tone2vec-E5D4}{https://anonymous.4open.science/r/Tone2Vec-E5D4}~\footnote{This IPYNB file contains all the experimental details presented in this paper. The official package will be released upon acceptance.}.
Paper Type: Long
Research Area: Computational Social Science and Cultural Analytics
Research Area Keywords: NLP tools for social analysis;
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: Sino-Tibetan Tonal languages; Jianghuai Mandarin; Standard Mandarin
Submission Number: 66
Loading