Abstract: Abstract. Insufficient research has been conducted on the validity of datasets pertaining to the translation of Chinese Patent Medicine Instructions into English. Upon analyzing the Chinese and English texts generated by prominent translation engines, we observe that the readability of translation is a sore point and the English translation standards lack consistency. There exists a restricted range of internet search platforms that are specifically designed for the purpose of Chinese Patent Medicine (CPM). The focus of these platforms centers on the domain of specialized terminology related to Chinese herbal medicine. To address these problems, we initially develop a Chinese Patent Medicine Instruction Dataset (CPMID) for Chinese-English translation. This dataset comprises 11,695 Chinese-English entries to be meticulously annotated and validated. We benchmark the task by training and testing multiple baselines including traditional models Seq2Seq+Attention (LSTM) and Transformer, pre-trained and released translation models SMaLL-100, NLLB-200, mBART-50, and ChatGPT. The dataset demonstrates the accuracy and effectiveness with improvement of 42.5 BLEU, surpassing prior state-of-the-art by over 54.7%. The primary objective of utilizing this dataset in future R&D is to provide a reliable retrieval system for foreign users of Chinese Patent Medicine (CPM). We believe that the implementation of CPMID has the potential to facilitate the modernization of Traditional Chinese Medicine (TCM) and significantly contribute to the field of Modern Medicine (MM).
Loading