Injecting Explicit Cross-lingual Embeddings into Pre-trained Mulilingual Models for Code-Switching Detection

ACL ARR 2025 February Submission212 Authors

04 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Code-switching has become the modus operandi of internet communication in the advent of many communities such as South Africans being domestically multilingual. This phenomenon has made processing textual data increasingly complex prior to non-standard ways of writing, spontaneous word replacements, etc. Pre-trained Multilingual Models, however, have shown elevated text processing capabilities in various similar downstream tasks such as language identification, dialect detection, closely related language discrimination, etc. In this study, we extensively investigate the use of pre-trained multilingual models -- AfroXLMR, and Serengeti for code-switching detection on five South African languages: Sesotho, Setswana, IsiZulu, IsiXhosa, and English, with English used interchangeably with the four languages, including various transfer learning settings. Additionally, we explore the modelling of known switching pairs within a dataset through explicit cross-lingual embedings extracted using projection models: VecMap, Muse, and Canonical Correlation Analyses (CCA). The resulting cross-lingual embeddings are used to replace the embedding layer of a pre-trained multilingual model without additional training. Concretely, our results show that performance gains can be realized by closing the representational gap between the languages of the code-switched dataset with known codes, using cross-lingual representations. Moreover, expanding code-switched datasets with datasets of closely related languages improves code-switching classification, especially in cases with minimal training examples.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Monolingual embeddings, Cross-lingual embeddings, Large pre-trained multilingual models, Code-switching detection
Contribution Types: Approaches to low-resource settings
Languages Studied: English, Sesotho, Setswana, IsiXhosa, IsiZulu
Submission Number: 212
Loading