SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions

TMLR Paper7725 Authors

02 Mar 2026 (modified: 31 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Compressing large-scale neural networks is essential for deploying models on resource-constrained devices. Most existing methods adopt weight pruning or low-bit quantization individually, often resulting in suboptimal compression rates to preserve acceptable performance drops. We introduce a unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (\method), which achieves higher compression rates than prior baselines while maintaining comparable performance. The key idea is to employ a spike-and-slab prior to inducing sparsity and model quantized weights using Gaussian Mixture Models (GMMs) to enable low-bit precision. Due to the intractability of the objective involving spike-and-slab priors with GMMs, we derive an efficient approximation that facilitates effective compression with minimal accuracy loss. In theory, we provide the consistent result of our proposed variational approach to a sparse and quantized deep neural network. Extensive experiments on compressing ResNet, BERT-base, Llama3, and Qwen2.5 models show that our method achieves higher compression rates than a line of existing methods with comparable performance drops. Code implementation of SQS and baselines is available at: https://anonymous.4open.science/r/SQS_private-411C.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Enzo_Tartaglione1
Submission Number: 7725
Loading