Abstract: Code-switching (CS) is the process of speakers interchanging between two or more languages. The scarcity of CS data drives the development of advanced Natural Language Processing (NLP) and Automatic Speech Recognition (ASR) models, which incorporate ideas from linguistic theory. To describe CS, the Matrix Language Frame (MLF) theory defines a Matrix Language (ML) as the language that provides the grammatical structure for a CS sentence. System morphemes are the type of morphemes that contribute to a sentence's grammatical structure and are used in the MLF theory for ML detection. The discovery of the system morphemes can improve automatic CS text generation and analysis. This work introduces several novel approaches for discovering system morphemes based on the MLF theory. Deterministic and predictive System Morpheme Principle (SMP) implementations are used to discover system words approximating system morphemes through the task of ML determination and prediction for SEAME and Miami datasets. The outputs from the two SMP implementations are compared to the outputs of the Morpheme Order Principle (MOP). Applying the deterministic SMP approach revealed that the conventional system words (pronouns, conjunctions, determiners, auxiliaries) fall into top 50\% most frequent word frequencies when averaged over Part of Speech (POS). Moreover, the deterministic SMP has also revealed the ranking of the POS with respect to the ML determination task, showing the importance of particles and adpositions. Using the extracted system words in a CS text simulation task leads to a total of 3.3\% fluency improvement, demonstrating the advantages of the statistical analysis of the linguistic properties of data in the deterministic SMP. This study provides valuable insight into the properties of tokens in relation to their grammatical categories in CS data.
Paper Type: Long
Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics
Research Area Keywords: linguistic theories, code-switching, grammar and knowledge-based approaches
Contribution Types: Approaches to low-resource settings, Data analysis, Theory
Languages Studied: English, Spanish, Mandarin
Submission Number: 2438
Loading