TongueSwitcher: Fine-Grained Identification of German-English Code-Switching

Published: 11 Oct 2023, Last Modified: 20 Oct 2023EMNLP 2023 Workshop CALCSEveryoneRevisionsBibTeX
Keywords: code-switching, code-mixing, corpus, dataset, german, english, mixed, ambiguous, interlingual, homograph, wordlist, multilingual, stemming
TL;DR: We present the largest corpus of naturally occurring German-English code-switching.
Abstract: This paper contributes to German-English code-switching research. We provide the largest corpus of naturally occurring German-English code-switching, where English is included in German text, and two methods for code-switching identification. The first method is rule-based, using wordlists and morphological processing. We use this method to compile a corpus of 25.6M tweets employing German-English code-switching. In our second method, we continue pretraining of a neural language model on this corpus and classify tokens based on embeddings from this language model. Our systems establish SoTA on our new corpus and an existing German-English code-switching benchmark. In particular, we systematically study code-switching for language-ambiguous words which can only be resolved in context, and morphologically mixed words consisting of both English and German morphemes. We distribute both corpora and systems to the research community.
Submission Type: Regular Long Paper(8 pages)
Submission Number: 2
Loading