Dual-lens: Model-Aware Data Curation for Efficient and Effective Knowledge Recovery in Pruned Language Models
Abstract: Recovering capabilities in pruned language models typically requires fine-tuning on large datasets, but often yields suboptimal results since the original pretraining data is unavailable for state-of-the-art foundation models. In this paper, we propose \textit{Dual-lens}, a data curation framework that identifies compact, high-utility subsets from public corpora. Dual-lens combines two criteria: \textit{CE-lens}, which targets samples the pruned model finds difficult, and \textit{SAE-lens}, which ensures semantic coverage via sparse autoencoders trained on latent concept distributions. By performing a pipelined fine-tuning procedure with the two lens, the proposed framework balances model-specific correction and representational diversity. Experiments across various models, pruning schemes, and downstream tasks show that Dual-lens outperforms full-data tuning and recent baselines while using significantly less data, e.g., LLaMA 2.1 13B, pruned with 35\% pruning ratio, achieves a 22\% improvement in accuracy for downstream reasoning tasks using only 10\% of the full corpus of Alpaca dataset.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: dataset curation, sae
Contribution Types: Approaches low compute settings-efficiency, Data analysis
Languages Studied: English
Submission Number: 4261
Loading