Keywords: Machine Translation, Large Language Model, Historical Documents, RAG, Data Augmentation
Abstract: Historical archives are invaluable resources for multidisciplinary research but remain inaccessible to the global community due to language barriers. While manual translation is prohibitively expensive and time-consuming, the direct application of existing machine translation models is often inadequate due to the unique linguistic and historical nuances of these documents. To address these challenges, we propose a novel framework that leverages Retrieval-Augmented Generation (RAG) to generate high-quality pseudo-labeled data from abundant Hanja-Korean corpora. This approach expands the training dataset, effectively mitigating data scarcity and temporal overfitting observed in human-labeled corpora. Extensive evaluations demonstrate that HERIT significantly outperforms baseline models. Finally, we employ our model to translate previously untranslated portions of the archives, aiming to democratize access to these resources for researchers worldwide.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: few-shot/zero-shot MT, retrieval-augmented generation, fine-tuning, datasets for low resource languages, less-resourced languages, historical NLP
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: English, Hanja, Korean
Submission Number: 5360
Loading