Code-Switching Metrics Using Intonation Units

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX
Submission Type: Regular Long Paper
Submission Track: Multilinguality and Linguistic Diversity
Keywords: Computationally-aided linguistic analysis, Linguistic Diversity, Multilingualism and Cross-Lingual NLP, Spanish-English Code-Switching
TL;DR: While NLP models of code-switching (CS) have used word-level units, we advocate for a multi-word prosodic unit by showing how metrics of CS complexity are impacted by the distinction between other-language single-word items and multi-word strings.
Abstract: Code-switching (CS) metrics in NLP that are based on word-level units are misaligned with true bilingual CS behavior. Crucially, CS is not equally likely between any two words, but follows syntactic and prosodic rules. We adapt two metrics, multilinguality and CS probability, and apply them to transcribed bilingual speech, for the first time putting forward Intonation Units (IUs) – prosodic speech segments – as basic tokens for NLP tasks. In addition, we calculate these two metrics separately for distinct mixing types: alternating-language multi-word strings and single-word incorporations from one language into another. Results indicate that individual differences according to the two CS metrics are independent. However, there is a shared tendency among bilinguals for multi-word CS to occur across, rather than within, IU boundaries. That is, bilinguals tend to prosodically separate their two languages. This constraint is blurred when metric calculations do not distinguish multi-word and single-word items. These results call for a reconsideration of units of analysis in future development of CS datasets for NLP tasks.
Submission Number: 3987
Loading