Han2Han: A Denoising Autoencoder for Improved Korean Language Modeling and a Use Case in Tracing the Discursive Formation of Modern Korean Art

ACL ARR 2024 August Submission483 Authors

16 Aug 2024 (modified: 06 Sept 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: When using natural language processing for Korean historical texts, it is common for corpora to include texts in scripts belonging to several languages, leading to poor compatibility. To address this, we introduce a sequence-to-sequence denoising pre-training method with enhanced word embeddings designed to model the characteristics of Korean etymology and written scripts. Our results show a significant increase in performance on almost all Korean Language Understanding Evaluation (KLUE - NeurIPS 2021) tasks, suggesting that the representations created by language models benefit from learning language-specific information. We then use our method in a use case to track discursive changes in the 20th century in Korean art-critical textual data, enabling the modeling of diachronic semantic change in historical Korean texts.
Paper Type: Long
Research Area: Computational Social Science and Cultural Analytics
Research Area Keywords: topic modeling,sociolinguistics, historical NLP, mixed language, domain adaptation, word embeddings, NLP in resource-constrained settings
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: Korean
Submission Number: 483
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview