Task-aware Block Pruning with Output Distribution Signals for Large Language Models

ACL ARR 2025 July Submission1261 Authors

29 Jul 2025 (modified: 19 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models provide excellent performance, but their practical deployment is limited by significant inference costs. While block pruning effectively reduces latency with structural coherence, existing methods typically rely on representation similarity or costly sensitivity analyses, neglecting task-specific model behavior. This paper introduces an output-driven pruning method leveraging entropy-based estimations of output distributions to accurately identify less important model blocks. Extensive experiments validate the proposed method's effectiveness, demonstrating substantial efficiency gains without compromising downstream task performance.
Paper Type: Short
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: Efficient/Low-Resource Methods for NLP, Interpretability and Analysis of Models for NLP, Language Modeling
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English
Submission Number: 1261
Loading