Keywords: Post-training Model Compression
Abstract: This paper introduces a new method for the low-rank compression of large language models. Existing techniques typically compress the weights individually, overlooking the internal dependencies within a transformer block. To address this limitation, we formulate a joint optimization problem to find the optimal low-rank weights for an entire transformer block, thereby minimizing the output reconstruction error. Our formulation allows the incorporation of key architectural elements, including residual connections and normalizations. We then introduce SLIM, an efficient algorithm to solve this optimization problem. Experimental results demonstrate that our method consistently achieves task accuracy improvements of over 5\% compared to existing techniques across a range of compression ratios and model families.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21471
Loading