Keywords: supervised fine-tuning, data-efficient training, data efficiency, data selection
Abstract: LLM supervised fine-tuning (SFT) data have become abundant, mainly driven by synthetic generation. However, sheer scale does not guarantee effective SFT: large datasets often exhibit pathologies such as redundancy, imbalance, and poor learnability, making data selection critical for improving SFT data efficiency and efficacy. Existing selection methods typically rely on static or handcrafted criteria, which can be subjective, biased, and lack transferability. To address this, we propose **LeRS** (**Le**arner Self-**R**ectify **S**election), a learner-centric data selection framework. It enables the learner model, or its compact homologue, to identify learning-worthy samples via self-feedback signals, and dynamically rectify the training subset distribution throughout the learning process. By focusing on the learner's evolving needs rather than static metrics, LeRS surfaces suitably challenging and unmastered samples that are otherwise overlooked. Experiments show that LeRS boosts data efficiency and efficacy : using only $10$% of the data, it matches the SFT performance of $5\times$ randomly sampled data, and yields consistent gains over full-dataset training in multi-source scenarios. Our findings reveal that prioritizing high-utility data that dynamically address the learner’s needs is the key to “more with less” in SFT.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: data-efficient training
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 4903
Loading