Augmenting Classical Chinese - Modern Vietnamese Couplet Parallel Corpus Using LLM-Assisted Methods

The-Anh Hoang, Nhat-Hung Dang-Hoang, Long Nguyen, Dien Dinh

Published: 01 Jan 2026, Last Modified: 14 Dec 2025CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: Classical Chinese couplets represent a significant literary genre, yet their complex script system and the limited availability of resources make it hard for computational processing. This research aims to address the issue of data scarcity by exploring and evaluating various data augmentation techniques leveraging large language models (LLMs) such as TongGu and Llama3. We collected and processed a Classical Chinese - Modern Vietnamese parallel corpus, comprising over 46,000 lines, and applied three augmentation strategies: Classical-Modern back-translation, character variant substitution, and masked language model-based augmentation. Preliminary evaluation using Moses statistical machine translation (SMT) and an open source neural machine translation (NMT) system, OpenNMT, was performed. Our findings highlight the potential of combining all three of these data augmentation methods, providing a foundation for future research aimed at developing more robust Natural Language Processing (NLP) tools for this under-resourced domain, such as machine translation and named entity recognition.
Loading