AFORA: Activation-aware Factorization with Optimal Rank Allocation for Training-free LLM Compression

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language models, Model compression, Low-rank approximation, Activation-aware methods, Rank allocation
TL;DR: We introduce TACO, a training-free, activation-aware, and optimally allocated low-rank compression method that enables efficient deployment of large language models.
Abstract: Large language models are challenging to deploy because of their extreme size and compute demands. In this work, we propose AFORA (Activation-aware Factorization with Optimal Rank Allocation for Training-free LLM Compression), a simple and hardware-friendly framework that directly reduces the number of parameters through low-rank factorization of weight matrices. AFORA consists of two core components: (1) Activation-aware Weight Factorization (AWF), a closed-form low-rank approximation that accounts for the input activation distribution to preserve task-relevant directions and ensure numerical stability; and (2) Optimal Rank Allocation (ORA), a global rank allocation strategy that assigns heterogeneous ranks across layers to minimize activation distortion under a given budget. Evaluations across multiple large-scale language models show that our framework consistently outperforms existing approaches at the same compression ratios, while additionally reducing model size, saving memory, and decreasing computation with hardware-friendly layer dimensions. It also requires only a short runtime to perform compression, and offers a principled mathematical interpretation. These results demonstrate that activation-aware, globally optimized low-rank compression offers a practical and theoretically grounded path to efficient LLM deployment.
Primary Area: optimization
Submission Number: 10760
Loading