Diversifying language models for lesser-studied languages and language-usage contexts: A case of second language Korean

Hakyung Sung; Gyu-Ho Shin

Diversifying language models for lesser-studied languages and language-usage contexts: A case of second language Korean

Hakyung Sung, Gyu-Ho Shin

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Multilinguality and Linguistic Diversity

Submission Track 2: Phonology, Morphology, and Word Segmentation

Keywords: Multilinguality, DEI, NLP applications, L2 Korean, Morpheme parsing/tagging

TL;DR: This study explores the applicability of existing morpheme parsers/taggers to lesser-studied languages and contexts (L2 Korean) by training a neural-network model on diverse L2 Korean data and evaluating its parsing/tagging performance.

Abstract: This study investigates the extent to which currently available morpheme parsers/taggers apply to lesser-studied languages and language-usage contexts, with a focus on second language (L2) Korean. We pursue this inquiry by (1) training a neural-network model (pre-trained on first language [L1] Korean data) on varying L2 datasets and (2) measuring its morpheme parsing/POS tagging performance on L2 test sets from both the same and different sources of the L2 train sets. Results show that the L2 trained models generally excel in domain-specific tokenization and POS tagging compared to the L1 pre-trained baseline model. Interestingly, increasing the size of the L2 training data does not lead to improving model performance consistently.

Submission Number: 787

Loading