Beyond Detection: Predicting Code-Switch Points in Multilingual Conversations
Keywords: Machine Learning, Natural Language Processing, Multilingual, Code-Switching
Abstract: Code-switching, the practice of multilingual speakers alternating between two or more languages, is a
phenomenon commonly observed in bilingual communities. It introduces significant challenges for natural
language processing (NLP) systems, which are typically designed for monolingual inputs. While existing NLP
models can identify language and detect language switches after they occur, they do not predict upcoming switch
points in real-time. This capability is crucial for real-time applications such as voice assistants, chatbots, and
predictive keyboards, where an ability to anticipate a language switch can prevent disruptions to downstream
tasks. This work attempts to fill that void.
In this work, we introduce a novel token-level predictive framework for predicting upcoming switch
points in Chinese-English conversations. Our study investigates two modeling paradigms: (1) a window-based
model using BERT embeddings and recurrent architectures, and (2) a transformer-based approach using
multilingual pre-trained models (mBERT and XLM-RoBERTa). We conducted our experiments on the ASCEND
dataset, a high-quality corpus of spontaneous, multi-turn Chinese-English conversational dialogue, comprising
over 10 hours of speech. The results show that, among the window-based models trained with both fixed and
flexible context windows, our best RNN model achieves an AUC of 0.91 for Chinese-to-English prediction. In
comparison, the transformer-based mBERT model achieves an AUC of 0.98. These results show promise in the
code-switch prediction task and offer potential to support mixed-language NLP applications such as
conversational AI and machine translation.
These findings demonstrate that modern NLP systems can be trained to anticipate language switches, not
just detect and identify them, which holds substantial potential for advancing multilingual applications. As the
first study to investigate token-level code-switching prediction for Chinese-English using deep learning, our work
also opens several promising avenues for future research. These include extending our approach to handle more
than two languages and enhancing the predictive capability to determine the target language of the upcoming
code-switch.
Submission Number: 133
Loading