Training-Free Layer Fusion in Weight Space for Plug-and-Play LLM Compression

ACL ARR 2026 January Submission3455 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Compression, Efficient LLM, LLM Streamline
Abstract: Large language models (LLMs) face significant constraints in practical applications due to their massive size and high inference costs. Existing compression techniques—such as quantization, distillation, and pruning—often suffer from performance degradation, reliance on fine-tuning, or limited hardware support. Recent layer pruning methods reduce model depth but fail to adequately preserve the functional information of removed layers. To address these limitations, we propose \textbf{Layer Fusion (LF)}, a novel compression framework that fuses weights from multiple Transformer layers without fine-tuning or extensive data. LF operates in five stages: identifying layer features, selecting fusion targets, extracting residual weights, balancing parameter importance, and generating composite weights. Our method requires only minimal probe data and preserves the original model structure, facilitating efficient hardware inference. Experiments show that LF outperforms existing compression approaches across multiple benchmarks and model architectures, achieving a superior performance-size trade-off with minimal computational overhead. The framework is highly scalable and compatible, offering a new direction for efficient model deployment.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Efficient/Low-Resource Methods for NLP, Generation, Interpretability and Analysis of Models for NLP, Machine Learning for NLP
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Data resources, Theory
Languages Studied: English
Submission Number: 3455
Loading