Leveraging Web-Crawled Data for High-Quality Fine-Tuning

ACL ARR 2024 June Submission1512 Authors

14 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Most large language models are fine-tuned using either expensive human-annotated data or GPT-4 generated data which cannot guarantee performance in certain domains. We argue that although the web-crawled data often has formatting errors causing semantic inaccuracies, it can still serve as a valuable source for high-quality supervised fine-tuning in specific domains without relying on advanced models like GPT-4. To this end, we create a paired training dataset by aligning web-crawled data with a smaller set of high-quality data. By training a language model on this dataset, we can convert web data with irregular formats into high-quality ones. Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average of 9.4\% in Chinese elementary school math problems. Additionally, our 7B model outperforms several open-source models larger than 30B and surpasses well-known closed-source models such as GPT-3.5 and Claude-2, highlighting the efficacy of our approach.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: language modeling, fine-tuning
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: Chinese
Submission Number: 1512
Loading