Keywords: Large Language Models, Model Compression, Activation-aware Low-rank Approximation, Alternating Least Squares
TL;DR: ALS-ActLR: a SIMT-guided, activation-aware ALS plus uncertainty-weighted distillation pipeline that compresses LLMs 40–80% under the LRA–then–update paradigm, achieving minimal loss from a tiny (256-sample) calibration set.
Abstract: Large language models (LLMs) achieve state-of-the-art performance but remain impractical for on-device deployment due to memory and compute constraints, making compression essential. Activation-aware low-rank approximation is promising, yet existing methods follow a two-step \emph{approximate-then-factorize} routine, which couples the factors and weakens preservation of salient activation structure. We present \emph{ALS-ActLR}, which combines a spectral-informed metric transformation (SIMT) with \emph{Activation-aware ALS} to optimize the low-rank factors directly using a tiny calibration set (256 samples). A subsequent uncertainty-weighted distillation stage further recovers lost information by adaptively balancing cross-entropy, knowledge distillation, and feature alignment. Experiments show that \emph{ALS-ActLR} substantially reduces parameters and FLOPs while preserving accuracy and perplexity, consistently outperforming strong baselines. Concretely, on Llama-7B at 60\% compression (i.e., 60\% parameters removed, 40\% retained), it reduces mean perplexity to 27.74 (a 69.0\% reduction over the best baseline) and raises accuracy to 48.92\% (+3.72 points), while achieving the best scores across 40--80\% compression and a wide range of model scales from 1.1B to 13B across multiple families. These results highlight \emph{ALS-ActLR} as a scalable and effective framework for activation-aware compression.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7695
Loading