IteRABRe: Iterative Recovery-Aided Block Reduction

IteRABRe: Iterative Recovery-Aided Block Reduction

ACL ARR 2025 February Submission3366 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have grown increasingly expensive to deploy, driving the need for effective model compression techniques. While block pruning offers a straightforward approach to reducing model size, existing methods often struggle to maintain performance or require substantial computational resources for recovery. We present IteRABRe, a simple yet effective iterative pruning method that achieves superior compression results while requiring minimal computational resources. Using only 2.5M tokens for recovery, our method outperforms baseline approaches by ~3\% on average when compressing the Llama3.1-8B and Qwen2.5-7B models. IteRABRe demonstrates particular strength in the preservation of linguistic capabilities, showing an improvement 5\% over the baselines in language-related tasks. Our analysis reveals distinct pruning characteristics between these models, while also demonstrating preservation of multilingual capabilities.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: pruning, distillation, data-efficient training, multilingualism, probing

Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency

Languages Studied: English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese (Simplified), Hindi, Swahili, Urdu, Indonesian, Telugu, Basque, Burmese

Submission Number: 3366

Loading