Dual-lens: Model-Aware Data Curation for Efficient and Effective Knowledge Recovery in Pruned Language Models

Dual-lens: Model-Aware Data Curation for Efficient and Effective Knowledge Recovery in Pruned Language Models

ACL ARR 2025 May Submission4261 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recovering capabilities in pruned language models typically requires fine-tuning on large datasets, but often yields suboptimal results since the original pretraining data is unavailable for state-of-the-art foundation models. In this paper, we propose \textit{Dual-lens}, a data curation framework that identifies compact, high-utility subsets from public corpora. Dual-lens combines two criteria: \textit{CE-lens}, which targets samples the pruned model finds difficult, and \textit{SAE-lens}, which ensures semantic coverage via sparse autoencoders trained on latent concept distributions. By performing a pipelined fine-tuning procedure with the two lens, the proposed framework balances model-specific correction and representational diversity. Experiments across various models, pruning schemes, and downstream tasks show that Dual-lens outperforms full-data tuning and recent baselines while using significantly less data, e.g., LLaMA 2.1 13B, pruned with 35\% pruning ratio, achieves a 22\% improvement in accuracy for downstream reasoning tasks using only 10\% of the full corpus of Alpaca dataset.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: dataset curation, sae

Contribution Types: Approaches low compute settings-efficiency, Data analysis

Languages Studied: English

Submission Number: 4261

Loading