HERIT: Democratizing Global Access to Korean Historical Archives via RAG-based Data Augmentation

HERIT: Democratizing Global Access to Korean Historical Archives via RAG-based Data Augmentation

ACL ARR 2026 January Submission5360 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Machine Translation, Large Language Model, Historical Documents, RAG, Data Augmentation

Abstract: Historical archives are invaluable resources for multidisciplinary research but remain inaccessible to the global community due to language barriers. While manual translation is prohibitively expensive and time-consuming, the direct application of existing machine translation models is often inadequate due to the unique linguistic and historical nuances of these documents. To address these challenges, we propose a novel framework that leverages Retrieval-Augmented Generation (RAG) to generate high-quality pseudo-labeled data from abundant Hanja-Korean corpora. This approach expands the training dataset, effectively mitigating data scarcity and temporal overfitting observed in human-labeled corpora. Extensive evaluations demonstrate that HERIT significantly outperforms baseline models. Finally, we employ our model to translate previously untranslated portions of the archives, aiming to democratize access to these resources for researchers worldwide.

Paper Type: Long

Research Area: Machine Translation

Research Area Keywords: few-shot/zero-shot MT, retrieval-augmented generation, fine-tuning, datasets for low resource languages, less-resourced languages, historical NLP

Contribution Types: Publicly available software and/or pre-trained models, Data resources

Languages Studied: English, Hanja, Korean

Submission Number: 5360

Loading