A Study of Speech Embedding Similarities Between Australian Aboriginal and High-Resource Languages

Published: 2025, Last Modified: 27 Jan 2026INTERSPEECH 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Low-resource languages, such as Australian Aboriginal Languages, are underrepresented in the AI landscape due to limited availability of digital data, which in turn hinders speech processing model development. Leveraging sufficiently similar high-resource languages may help bridge this gap. This study examines the similarities between speech embeddings of aboriginal languages and 107 high-resource languages, including English, Spanish, and Mandarin, using Wav2Vec2 and VoxLingua107-ECAPA-TDNN. Through three language identification tasks, we analyze Warlpiri, Dalabon, and Light Warlpiri alongside 107 other languages. Our results reveal that aboriginal languages are most frequently identified as Māori, suggesting phonetic or structural similarities, while showing significant differences from globally dominant languages. Additionally, we also observe that Warlpiri and Dalabon exhibited closer matches with Hindi and Malayalam, than with other languages.
Loading