WaLeM: Layer Pruning via Weight-Aware Learnable Merging for Large Language Model Compression

WaLeM: Layer Pruning via Weight-Aware Learnable Merging for Large Language Model Compression

ACL ARR 2026 January Submission5049 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Compression, Layer Pruning, Merging

Abstract: Layer pruning has emerged as a promising technique for compressing Large Language Models (LLMs) by reducing their depth. However, existing methods predominantly rely on direct layer removal or coarse-grained layer merging, where uniform merging coefficients are applied to entire layers. These approaches neglect the distinct functional roles of internal components, often leading to severe feature blurring and performance degradation. To address this, we propose Weight-aware Learnable Merging (WaLeM), a noval framework that transitions from coarse-grained, heuristic layer pruning to fine-grained, optimization-driven layer merging. In WaLeM, we first employs Centered Kernel Alignment (CKA) combined with dynamic programming to globally identify redundant layers for merging, ensuring structural consistency. Subsequently, we introduce a learnable merging mechanism that assigns adaptive, component-specific coefficients to Transformer's weight matrices, optimized via knowledge distillation. Extensive experiments on various models demonstrate that WaLeM significantly outperforms state-of-the-art baselines. Notably, WaLeM preserves complex reasoning capabilities on benchmarks, like GSM8K, even at high compression rates, offering a superior trade-off between efficiency and performance.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: pruning

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 5049

Loading