DReSS: Data-driven Regularized Structured Streamlining for Large Language Models

DReSS: Data-driven Regularized Structured Streamlining for Large Language Models

ACL ARR 2026 January Submission5404 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Pruning, Structured Pruning, Model Compression

Abstract: Large language models (LLMs) have achieved significant progress across various domains, but their increasing scale leads to high computational and memory costs. Recent studies show that LLMs exhibit sparsity, which can be exploited for pruning. Existing pruning methods typically follow a prune-then-finetune paradigm. Since the pruned components still contain valuable information, their direct removal often leads to irreversible performance degradation, causing expensive fine-tuning to recover performance. To address this, we propose a new paradigm: first apply regularization, then prune, and finally fine-tune. Based on this paradigm, we propose DReSS, a simple and effective **D**ata-driven **Re**gularized **S**tructured **S**treamlining method for LLMs. By using a small amount of data to regularize the components before pruning, DReSS transfers the important information to the remaining parts of the model in advance. Compared to direct pruning, this can reduce the information loss caused by parameter removal, thereby enhancing its language modeling capabilities. We evaluate our method on various LLMs, including Phi-2, OPT, LLaMA2, LLaMA3. Experimental results demonstrate DReSS even without recovery fine-tuning (RFT) achieves comparable performance to previous methods, drastically alleviating computational costs. Moreover, DReSS significantly outperforms existing powerful pruning methods even under extreme pruning ratios, significantly reducing latency and increasing throughput.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: Efficient/Low-Resource Methods for NLP

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 5404

Loading