Leveraging Web-Crawled Data for High-Quality Fine-Tuning

Leveraging Web-Crawled Data for High-Quality Fine-Tuning

ACL ARR 2024 June Submission1512 Authors

14 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Most large language models are fine-tuned using either expensive human-annotated data or GPT-4 generated data which cannot guarantee performance in certain domains. We argue that although the web-crawled data often has formatting errors causing semantic inaccuracies, it can still serve as a valuable source for high-quality supervised fine-tuning in specific domains without relying on advanced models like GPT-4. To this end, we create a paired training dataset by aligning web-crawled data with a smaller set of high-quality data. By training a language model on this dataset, we can convert web data with irregular formats into high-quality ones. Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average of 9.4\% in Chinese elementary school math problems. Additionally, our 7B model outperforms several open-source models larger than 30B and surpasses well-known closed-source models such as GPT-3.5 and Claude-2, highlighting the efficacy of our approach.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: language modeling, fine-tuning

Contribution Types: NLP engineering experiment, Approaches to low-resource settings

Languages Studied: Chinese

Submission Number: 1512

Loading