RoCCo: Rotation-Augmented Clustering-based Low-rank Approximation for Compressing Large Language Models

ICLR 2026 Conference Submission21329 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, compression, low-rank approximation, clustering
Abstract: The immense size and computational cost of Large Language Models (LLMs) present significant barriers to their widespread deployment. Low-Rank Approximation (LRA) offers a promising, hardware-friendly solution by factorizing large weight matrices into more compact forms. A key insight is that the accuracy of this factorization can be significantly enhanced by first applying a geometric transformation to the model's weights. In this work, we introduce RoCCo (Rotation-augmented Clustering for Compression), a novel LRA framework that uses clustering to factorize weight matrices. We first apply an orthogonal transform to shape restructure the weight geometry to be more suitable to clustering. We then apply a group-wise clustering algorithm to the transformed weights to achieve a precise approximation. Furthermore, we demonstrate that this factorized representation enables a novel clustered attention mechanism, which reduces the algorithmic complexity of inference by performing attention computations directly in the compressed domain. Through experiments on the LLaMA and OPT model families, we show that RoCCo can compress models by 75\% while retaining over 96\% of the original zero-shot accuracy on LLaMA2-13B achieving a competitive compression-accuracy trade-off.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21329
Loading