Abstract: Recent advances in large language model (LLM) compression have predominantly focused on pruning and low-rank factorization, leaving weight sharing—despite its success in classical neural network compression—largely unexplored. We introduce \textsc{LayerDecompose}, a novel framework that reduces parameter redundancy by sharing a core weight matrix across transformer layers and augmenting each layer with lightweight, low-rank adapters. Unlike prior SVD- and pruning-based methods, our joint optimization of shared weights and residual adapters achieves a 30\% model size reduction while retaining 89\% of the original performance on seven standard benchmarks. Experiments on LLaMA-7B and three other 7B-parameter models demonstrate that \textsc{LayerDecompose} consistently outperforms state-of-the-art baselines. These results highlight the promise of combining weight sharing with low-rank adaptation for efficient, scalable LLM deployment.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: parameter-efficient-training, distillation, scaling
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 6263
Loading