Abstract: Classical Chinese couplets represent a significant literary genre, yet their complex script system and the limited availability of resources make it hard for computational processing. This research aims to address the issue of data scarcity by exploring and evaluating various data augmentation techniques leveraging large language models (LLMs) such as TongGu and Llama3. We collected and processed a Classical Chinese - Modern Vietnamese parallel corpus, comprising over 46,000 lines, and applied three augmentation strategies: Classical-Modern back-translation, character variant substitution, and masked language model-based augmentation. Preliminary evaluation using Moses statistical machine translation (SMT) and an open source neural machine translation (NMT) system, OpenNMT, was performed. Our findings highlight the potential of combining all three of these data augmentation methods, providing a foundation for future research aimed at developing more robust Natural Language Processing (NLP) tools for this under-resourced domain, such as machine translation and named entity recognition.
External IDs:doi:10.1007/978-3-032-10202-7_27
Loading