Keywords: data efficacy; language models
Abstract: Data is fundamental to the training of language models (LM).
Recent research has focused on data efficiency, aiming to reduce data scale without compromising model performance.
However, **data efficacy**, emphasizes improving model performance by optimizing the utilization of training data, is an area that remains underexplored.
To enhance it, we propose novel methods for both data ordering and data scoring.
For data ordering, we design *Folding Ordering (FO)*, which addresses challenges such as data distribution bias and model forgetting introduced by traditional curriculum learning.
For data scoring, we present *Learnability-Quality Scoring (LQS)*, the first method specifically designed to support both data ordering and selection.
To further establish the foundation for data efficacy, a general paradigm, **DELT** (**D**ata **E**fficacy for **L**M **T**raining), is introduced to underscore the importance of training data utilization. It comprises two essential modules: data scoring and data ordering, along with one optional module of data selection.
This primarily enables DELT to improve data efficacy as well as efficiency.
Comprehensive experiments validate our approach, demonstrating that FO and LQS significantly improve LM performance across various settings, consistently surpassing existing baselines.
We believe that data efficacy, which aims to fully harness data value without altering data scale and model size to benefit model performance, is a promising foundational area in LM training.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 2043
Loading