Abstract: The latest advancements in foundational large language models (LLMs) have challenged the widely recognized scaling laws, primarily manifesting in the reinterpretation of the relationship between model scale, data scale, and model capabilities. This paper proposes a novel research perspective by treating the model's holistic weights as a system variable. Through preliminary subtle scaling of the model during supervised fine-tuning (SFT) — a method referred to as pre-scaling — we systematically investigate the relationship between performance evolution and model variations. Building on this approach, we conduct extensive experiments across various pre-trained language models (PLMs), revealing the discrete features of the model: loss particles and output particles. Through empirical investigation and theoretical analysis, we characterize the fundamental process and statistical properties of particle fission during SFT. According to the inherent properties of output particles, the coupling relationship between these particles and sample importance is established. Based on this insight, we propose a simple and efficient data selection method named Pre-Scaling Pruning (PSP), which comprises two strategies: $\mathrm{PSP_{one-shot}}$ and $\mathrm{PSP_{zero-shot}}$. Notably, at a pruning ratio of 50%, the data subset selected by $\mathrm{PSP_{one-shot}}$ achieves a higher average GLUE score than the full dataset, demonstrating that high-quality data subsets can not only reduce computational overhead but also enhance the model’s generalization capability.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: data-efficient training,generalization,data influence,scaling,fine-tuning
Languages Studied: English
Submission Number: 4098
Loading