Optimizing Chinese Lexical Simplification Across Word Types: A Hybrid Approach

ACL ARR 2024 June Submission1914 Authors

15 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper addresses the task of Chinese lexical simplification (CLS), which aims to replace complex words in a given sentence with simpler alternatives that retain the original meaning. One of the challenges in CLS is the scarcity of data resources. Previous unsupervised methods exhibit limited performance, while supervised methods struggle because of the lack of annotated data. We begin by evaluating the few-shot performance of several dialogue models at various scales on CLS, discovering that their effectiveness is sensitive to different word types. For large but expensive Large Language Models (LLMs), such as GPT-4, excel at simplifying in-dictionary common words and Chinese idioms compared to smaller models. Therefore, we propose an automatic knowledge distillation approach that generates training data for common words and Chinese idioms using GPT-4, and then use the training data to fine-tune smaller models in a unified but word-type aware manner. Besides, even GPT-4 encounters difficulties with out-of-dictionary (OOD) words. To address this, we employ a retrieval-based interpretation augmentation strategy, injecting relevant information from external sources into context. The experimental results show that the fine-tuned small models can obtain superior performance than GPT-4 for simplifying common words and idioms, which optimizes the balance between CLS performance and computational cost. The interpretation augmentation strategy can improve the performance of most models for simplifying OOD words.
Paper Type: Long
Research Area: Semantics: Lexical and Sentence-Level
Research Area Keywords: Large Language Models, Knowledge Distillation, Retrieval Augmentation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: Chinese
Submission Number: 1914
Loading