Abstract: Abstract
Code-switching, the practice of multilingual speakers switching between two or more languages within a conversation or sentence, is commonly observed in multilingual communities. It also poses unique challenges for natural language processing (NLP) system. While existing NLP models in the field can detect and process code-switched text, they do not predict switch points at token level. In this paper, we introduce a token-level prediction framework for identifying upcoming switch points in Chinese-English conversations. We present two approaches: a window-based model leveraging BERT embeddings and recurrent architectures, and a transformer-based model using mBERT and XLM-RoBERTa. Trained and evaluated on the ASCEND dataset, our best RNN-based model achieves an AUC of 0.91 for Chinese-to-English prediction, while our transformer-based model (mBERT) achieves an AUC of 0.98. These results show promise in the code-switch prediction task and offer potential to support mixed-language NLP applications such as conversational AI and machine translation.
Paper Type: Short
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Multilingualism and Cross-Lingual NLP, Machine Learning for NLP
Contribution Types: NLP engineering experiment
Languages Studied: Chinese, English
Submission Number: 2575
Loading