PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

Bowei He; Lihao Yin; Huiling Zhen; Mingxuan Yuan; Chen Ma

PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

Bowei He, Lihao Yin, Huiling Zhen, Mingxuan Yuan, Chen Ma

20 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Model Pruning, Recovery Training, Data Selection

TL;DR: To achieve efficient and balanced capability recovery for pruned LLMs, we propose the PASER method for post-training data seletion.

Abstract: Model pruning is an effective approach for compressing Large Language Models (LLMs) and improving inference efficiency. However, this process often leads to significant degradation of model capabilities. While post-training techniques such as instruction tuning are commonly employed to recover model performance, existing methods often overlook the uneven deterioration of model capabilities and incur high computational costs due to extensive recovery training. Moreover, some instruction data irrelevant to model capability recovery may introduce negative effects. To address these challenges, we propose the **P**ost-training d**A**ta **S**election method for **E**fficient pruned large language model **R**ecovery (**PASER**). PASER aims to identify instructions where model capabilities are most severely compromised within a certain recovery data budget. Our approach first applies manifold learning and spectral clustering to group recovery data in the semantic space, revealing capability-specific instruction sets. We then adaptively allocate the data budget to different clusters based on the degrees of model capability degradation. In each cluster, we prioritize data samples where model performance has declined dramatically. To mitigate potential negative transfer, we also detect and filter out conflicting or irrelevant recovery data. Extensive experiments demonstrate that PASER significantly outperforms conventional baselines, effectively recovering the general capabilities of pruned LLMs while utilizing merely 4\%-20\% of the original post-training data and substantially reducing training computational overhead.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2107

Loading