Abstract: Large Language Models (LLMs) have grown increasingly expensive to deploy, driving the need for effective model compression techniques. While block pruning offers a straightforward approach to reducing model size, existing methods often struggle to maintain performance or require substantial computational resources for recovery. We present IteRABRe, a simple yet effective iterative pruning method that achieves superior compression results while requiring minimal computational resources. Using only 2.5M tokens for recovery, our method outperforms baseline approaches by ~3\% on average when compressing the Llama3.1-8B and Qwen2.5-7B models. IteRABRe demonstrates particular strength in the preservation of linguistic capabilities, showing an improvement 5\% over the baselines in language-related tasks. Our analysis reveals distinct pruning characteristics between these models, while also demonstrating preservation of multilingual capabilities.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: pruning, distillation, data-efficient training, multilingualism, probing
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency
Languages Studied: English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese (Simplified), Hindi, Swahili, Urdu, Indonesian, Telugu, Basque, Burmese
Submission Number: 3366
Loading