Pre-trained language model for code-mixed text in Indonesian, Javanese, and English using transformer
Abstract: Pre-trained language models (PLMs) have become increasingly popular due to their ability to achieve state-of-the-art performance on various natural language processing tasks with less training data and time. However, they struggle when dealing with code-mixed data, characterized by colloquial language and inconsistent linguistic forms. This limitation arises because most available PLMs are trained monolingually or for individual languages. Furthermore, the availability of PLMs specifically designed for code-mixed text remains limited, especially for low-resource languages like Indonesian (ID) and Javanese (JV). Despite the significant number of speakers of these languages, both languages are underrepresented in the field of natural language processing. To address these issues, this study introduces IndoJavE, a series of pre-trained language models specifically designed for mixing languages between Indonesian (ID), Javanese (JV), and English (EN) texts. We developed four transformer-based models, IndoJavE-BERT, IndoJavE-RoBERTa, IndoJavE-IndoBERTweet, and IndoJavE-IndoBERT, using two approaches: training from scratch and transfer learning. Our results show that transfer learning is more efficient and effective than training from scratch. The IndoJavE models outperformed multilingual and monolingual models in various downstream NLP tasks, highlighting the importance of specialized pre-trained models for handling code-mixed texts. This study paves the way for future research directions, including exploring the models’ performance in diverse NLP tasks and developing more versatile pre-trained language models that can adapt to a broader range of code-mixed languages and dialects.
External IDs:dblp:journals/snam/HidayatullahALQ25
Loading