The Art of Data Selection: A Survey on Data Selection for Fine-Tuning Large Language Models

The Art of Data Selection: A Survey on Data Selection for Fine-Tuning Large Language Models

ACL ARR 2024 April Submission741 Authors

16 Apr 2024 (modified: 02 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recently, Large Language Models (LLMs) have seen significant advancements, and supervised fine-tuning (SFT) plays a pivotal role in unleashing LLMs' potential to follow the users' instructions. As an emerging research field, data selection for fine-tuning LLMs aims to select a subset from a given candidate dataset for training selective-enhanced models to improve their performance and accelerate their training. Although some studies have already investigated these works, there is a lack of comprehensive analysis and comparison of them to provide potential research directions. To fill the gap, we first summarize a three-step scheme for data selection on existing works, including data preprocessing, data selector construction, and data selector evaluation, and comprehensively sort out the existing works according to this scheme. Then, we conduct an in-depth analysis of existing works from their efficiency and feasibility by making quantitative and qualitative comparisons and find that (1) the model-specific method who takes the loss output of the pending fine-tune model as an optimized goal is more effective; (2) increasing the complexity of the selector can improve the performance of the selective-enhanced model, but it needs more careful design to avoid introducing external factors. Finally, we summarize the trends in data selection and point out that the current main challenges are the lack of unified and efficient data quality measurement, as well as data selection for specific tasks and multiple turns of conversations.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: Data selection, Survey, Large Language Model, Fine-tuning

Contribution Types: Surveys

Languages Studied: English

Section 2 Permission To Publish Peer Reviewers Content Agreement: Authors grant permission for ACL to publish peer reviewers' content

Submission Number: 741

Loading