Abstract: Formosan languages, spoken by the indigenous peoples of Taiwan, have unique roles in reconstructing Proto-Austronesian Languages. This paper presents a real-world Formosan language speech dataset, including 144 hours-news footage of 16 Formosan languages. One merit of the dataset is to look into the relationships among Formosan languages in vivo. With the help of deep learning models, we could analyze the speech data without transcription. Specifically, we first train a language classifier based on XLSR-53 to classify the 16 Formosan languages with an accuracy of 88%. Then, we extract the speech vector representations learned from the model and compare them with 153 manually coded linguistic typological features. The comparison suggests that the speech vectors reflect the phonological and morphological aspects of Formosan languages. In addition, these linguistic features are used to construct linguistic phylogeny, and the resulting genealogical grouping corresponds with previous literature. To sum up, the dataset opens up possibilities to investigate the current real-world use of the Formosan language.
Paper Type: long
0 Replies
Loading