Han2Han: A Denoising Autoencoder for Improved Korean Language Modeling and a Use Case in Tracing the Discursive Formation of Modern Korean Art
Abstract: When using natural language processing for Korean historical texts, it is common for corpora to include texts in scripts belonging to several languages, leading to poor compatibility. To address this, we introduce a sequence-to-sequence denoising pre-training method with enhanced word embeddings designed to model the characteristics of Korean etymology and written scripts. Our results show a significant increase in performance on almost all Korean Language Understanding Evaluation (KLUE - NeurIPS 2021) tasks, suggesting that the representations created by language models benefit from learning language-specific information. We then use our method in a use case to track discursive changes in the 20th century in Korean art-critical textual data, enabling the modeling of diachronic semantic change in historical Korean texts.
Paper Type: Long
Research Area: Computational Social Science and Cultural Analytics
Research Area Keywords: topic modeling,sociolinguistics, historical NLP, mixed language, domain adaptation, word embeddings, NLP in resource-constrained settings
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: Korean
Submission Number: 483
Loading