Keywords: LLM Compression, Efficient LLM, LLM Streamline
Abstract: Although large language models (LLMs) have demonstrated remarkable performance in natural language processing tasks, their \textbf{massive parameter counts} and \textbf{high inference costs} severely limit practical applications. Existing lightweight approaches, such as quantization, knowledge distillation, and pruning, often suffer from significant \textbf{performance degradation}, heavy \textbf{reliance on fine-tuning}, or \textbf{insufficient hardware support}. In recent years, layer pruning has gained attention as a structurally friendly compression strategy. However, existing methods still \textbf{struggle to adequately preserve the functional information} within removed layers and typically \textbf{require complex post-processing}. To address these issues, we propose a novel \ul{\textbf{\atxt{Layer Fusion (LF)}}} framework, which compresses models by fusing functional weights across multiple Transformer layers with \textbf{no fine-tuning required} and \textbf{without extensive data requirements}. The LF framework consists of five core modules: identifying layer features, determining fusion targets, extracting residual weights, balancing parameter importance, and generating composite weights through fusion. This approach requires only \textbf{a small amount of probe data} and \textbf{facilitates efficient hardware inference}. Experiments demonstrate that LF significantly \textbf{outperforms mainstream model compression techniques} across multiple benchmarks and model architectures, achieving a superior performance-size trade-off with lower computational overhead. Moreover, LF exhibits \textbf{strong scalability and compatibility}, offering a new direction for model compression research.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 10201
Loading