Beyond Detection: Predicting Code-Switch Points in Multilingual Conversations

Published: 22 Sept 2025, Last Modified: 27 Nov 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine Learning, Natural Language Processing, Multilingual, Code-Switching
Abstract: Code-switching, the practice of multilingual speakers alternating between two or more languages, is a phenomenon commonly observed in bilingual communities. It introduces significant challenges for natural language processing (NLP) systems, which are typically designed for monolingual inputs. While existing NLP models can identify language and detect language switches after they occur, they do not predict upcoming switch points in real-time. This capability is crucial for real-time applications such as voice assistants, chatbots, and predictive keyboards, where an ability to anticipate a language switch can prevent disruptions to downstream tasks. This work attempts to fill that void. In this work, we introduce a novel token-level predictive framework for predicting upcoming switch points in Chinese-English conversations. Our study investigates two modeling paradigms: (1) a window-based model using BERT embeddings and recurrent architectures, and (2) a transformer-based approach using multilingual pre-trained models (mBERT and XLM-RoBERTa). We conducted our experiments on the ASCEND dataset, a high-quality corpus of spontaneous, multi-turn Chinese-English conversational dialogue, comprising over 10 hours of speech. The results show that, among the window-based models trained with both fixed and flexible context windows, our best RNN model achieves an AUC of 0.91 for Chinese-to-English prediction. In comparison, the transformer-based mBERT model achieves an AUC of 0.98. These results show promise in the code-switch prediction task and offer potential to support mixed-language NLP applications such as conversational AI and machine translation. These findings demonstrate that modern NLP systems can be trained to anticipate language switches, not just detect and identify them, which holds substantial potential for advancing multilingual applications. As the first study to investigate token-level code-switching prediction for Chinese-English using deep learning, our work also opens several promising avenues for future research. These include extending our approach to handle more than two languages and enhancing the predictive capability to determine the target language of the upcoming code-switch.
Submission Number: 133
Loading