Towards Better Translations from Classical to Modern Chinese: A New Dataset and a New Method

Published: 01 Jan 2023, Last Modified: 04 Apr 2025NLPCC (1) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Classical Chinese (Ancient Chinese) is the written language that was used in ancient China and has been an important carrier of Chinese culture for thousands of years. Numerous ideas of modern disciplines have been influenced or derived from it, including mathematics, medicine, engineering, etc., which demonstrated the necessity for us to understand, inherit and disseminate it. Consequently, there is an urgent need to develop neural machine translation to facilitate the comprehension of classical Chinese sentences. In this paper, we introduce a high-quality and comprehensive dataset called C2MChn, consisting of about 615K sentence pairs for the translation between classical and modern Chinese. To the best of our knowledge, this is the first dataset covering a wide range of domains including history books, Buddhist classics, Confucian classics, etc. Furthermore, through the analysis of classical and modern Chinese, we have proposed a simple yet effective method, named Syntax-Semantics Awareness Transformer (SSAT). It’s capable of leveraging both syntactic and semantic information which are indispensable for better translating classical Chinese. Experiments show that our model can achieve better BLEU scores than several state-of-the-art methods as well as two general translation engines including Microsoft and Baidu APIs. The dataset and related resources will be released at: https://github.com/Zongyuan-Jiang/C2MChn.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview