Beyond Detection: Predicting Code-Switch Points in Multilingual Conversations

Beyond Detection: Predicting Code-Switch Points in Multilingual Conversations

ACL ARR 2025 May Submission2575 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Abstract Code-switching, the practice of multilingual speakers switching between two or more languages within a conversation or sentence, is commonly observed in multilingual communities. It also poses unique challenges for natural language processing (NLP) system. While existing NLP models in the field can detect and process code-switched text, they do not predict switch points at token level. In this paper, we introduce a token-level prediction framework for identifying upcoming switch points in Chinese-English conversations. We present two approaches: a window-based model leveraging BERT embeddings and recurrent architectures, and a transformer-based model using mBERT and XLM-RoBERTa. Trained and evaluated on the ASCEND dataset, our best RNN-based model achieves an AUC of 0.91 for Chinese-to-English prediction, while our transformer-based model (mBERT) achieves an AUC of 0.98. These results show promise in the code-switch prediction task and offer potential to support mixed-language NLP applications such as conversational AI and machine translation.

Paper Type: Short

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: Multilingualism and Cross-Lingual NLP, Machine Learning for NLP

Contribution Types: NLP engineering experiment

Languages Studied: Chinese, English

Submission Number: 2575

Loading