Language Models for Code-switch Detection of te reo Māori and English in a Low-resource SettingDownload PDF

Anonymous

16 Jan 2022 (modified: 05 May 2023)ACL ARR 2022 January Blind SubmissionReaders: Everyone
Abstract: Te reo Māori, New Zealand's only indigenous language, is code-switched with English. Most Māori speakers are bilingual, and the use of Māori is increasing in New Zealand English. Unfortunately, due to the minimal availability of resources, including digital data, Māori is under-represented in technological advances. Cloud-based systems such as Google and Azure support Māori language detection. However, we provide experimental evidence to show that the accuracy of such systems is low when detecting Māori. Hence, with the support of Māori community, we collect Māori i and bilingual data to use natural language processing (NLP) to improve Māori language detection. We train bilingual sub-word embeddings and provide evidence to show that our bilingual embeddings improve overall accuracy compared to the publicly-available monolingual embeddings. This improvement has been verified for various NLP tasks using three bilingual databases containing formal transcripts and informal social media data. We also show that BiLSTM with bilingual sub-word embeddings outperforms large-scale contextual language models such as BERT on down streaming tasks of detecting Māori language. The best accuracy of 87% was obtained using BiLSTM with bilingual embeddings for detecting code-switch points of bilingual sentences.
Paper Type: long
0 Replies

Loading