Medical Text Classification with Data Augmentation Based on Baichuan2

Published: 01 Jan 2024, Last Modified: 19 May 2025CLSW (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Text classification is one of the most common tasks in the field of natural language processing. Currently, and the pre-trained fine-tuning paradigm has achieved outstanding results in text classification. However, for fine-tuning on downstream tasks, an ample amount of training data is still crucial. When manually annotated training data is insufficient, a common solution is data augmentation, using machine translation techniques for back translation. We generate sentence embeddings using pre-trained models and apply a similarity-based filtering method to automatically screen back-translated samples, thereby improving back-translation quality. By incorporating LoRa fine-tuning on the large model Baichuan2, we further improve text classification accuracy. Compared to the traditional BERT model, our approach achieved a 0.5%–1.0% accuracy improvement in the CHIP 2023 Task 6 for diabetes question classification.
Loading