Towards a Unified Theory of Quantization and Sparsity

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: model compression, quantization, sparsity
Abstract: Quantization and sparsification are two model compression strategies that are traditionally treated as orthogonal in the literature. Building on recent work, we show that jointly considering these techniques can meaningfully affect compression performance. First, we extend prior tensor-level analyses and prove that for any $L_p$ norm, applying sparsification before quantization ($\mathbf{S} \to \mathbf{Q}$) always yields lower errors than the reverse. However, we demonstrate that tensor-level analysis is insufficient to predict model performance, motivating the need for model-level evaluation. As such, we provide the first model-level analysis showing that $\mathbf{S} \to \mathbf{Q}$ obtains better loss in certain settings when we choose quantization and sparsification algorithms independently. Yet, this preference does have its limits. When fully relaxing model assumptions, we find it difficult to prove the superiority of $\mathbf{S} \to \mathbf{Q}$, casting doubt on the preference in the general case. To that end, we introduce Quantization-Aware Sparsification (QAS), a novel compression framework that sparsifies accounting for prior quantization, as a simple counterexample. Using this framework, we provide a simple counterexample in which $\mathbf{Q} \to \mathbf{S}$ using QAS performs comparably to $\mathbf{S} \to \mathbf{Q},$ illustrating that careful co-design between model compression steps can greatly influence performance.
Supplementary Material: zip
Primary Area: optimization
Submission Number: 23743
Loading