Training-Free Layer Fusion in Weight Space for Plug-and-Play LLM Compression

Training-Free Layer Fusion in Weight Space for Plug-and-Play LLM Compression

ACL ARR 2026 January Submission3455 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Compression, Efficient LLM, LLM Streamline

Abstract: Large language models (LLMs) face significant constraints in practical applications due to their massive size and high inference costs. Existing compression techniques—such as quantization, distillation, and pruning—often suffer from performance degradation, reliance on fine-tuning, or limited hardware support. Recent layer pruning methods reduce model depth but fail to adequately preserve the functional information of removed layers. To address these limitations, we propose \textbf{Layer Fusion (LF)}, a novel compression framework that fuses weights from multiple Transformer layers without fine-tuning or extensive data. LF operates in five stages: identifying layer features, selecting fusion targets, extracting residual weights, balancing parameter importance, and generating composite weights. Our method requires only minimal probe data and preserves the original model structure, facilitating efficient hardware inference. Experiments show that LF outperforms existing compression approaches across multiple benchmarks and model architectures, achieving a superior performance-size trade-off with minimal computational overhead. The framework is highly scalable and compatible, offering a new direction for efficient model deployment.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: Efficient/Low-Resource Methods for NLP, Generation, Interpretability and Analysis of Models for NLP, Machine Learning for NLP

Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Data resources, Theory

Languages Studied: English

Submission Number: 3455

Loading