CSAFT: Continuous Semantic Augmentation Fine-Tuning for Legal Large Language Models

Bo Li, Shuang Fan, Jin Huang

Published: 01 Jan 2024, Last Modified: 15 May 2025ICANN (5) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Previous researches have demonstrated that fine-tuning with legal data can significantly enhance the performance of Large Language Models (LLMs) in legal question-answer (Q&A), affirming that data augmentation is an effective strategy. However, the scarcity of high-quality datasets presents the major challenge to legal LLMs. Additionally, how to fine-tune general LLMs to implement legal Q&A tasks in natural language processing is not still under explored. To address this issue, we introduce a novel methodology named “Continuous Semantic Augmentation Fine-Tuning” (CSAFT). Firstly, we expand the “question” part of legal Q&A semantically. Secondly, we select sentences from the nearby semantic space to serve as new questions, while preserving the content of the “answer” part. Finally, we utilize this newly generated dataset to fine-tune the model. In contrast to traditional fine-tuning strategies, CSAFT employs a minimal amount of original data, expands the “question” in a continuous semantic space while maintaining the accuracy of the “answer”, ultimately generating diverse and reliable training samples to alleviate the issue of data scarcity. We conducted experiments on 5 datasets and 15 models. The results demonstrate a noticeable enhancement in the performance of LLMs on legal Q&A after undergoing CSAFT. Furthermore, human expert evaluations indicate that the overall score is comparable or even superior to that of the state-of-the-art Chinese legal models that have undergone extensive pre-training and fine-tuning with discrete data.