Distribution-Aware Tensor Decomposition for Compression of Convolutional Neural Networks

Alper KALLE; Théo Rudkiewicz; Mohamed Ouerfelli; mohamed Tamaazousti

Distribution-Aware Tensor Decomposition for Compression of Convolutional Neural Networks

Alper KALLE, Théo Rudkiewicz, Mohamed Ouerfelli, mohamed Tamaazousti

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Compression of neural networks, tensor decomposition, low-rank approximation, distribution-aware norm

TL;DR: Distribution-aware ALS algorithms for Tucker-2 and CP decompositions compress CNN weights by directly minimizing output-distribution shift, delivering competitive, fine-tuning-free accuracy that even transfers across datasets.

Abstract: Neural networks are widely used for image–related tasks but typically demand considerable computing power. Once a network has been trained, however, its memory‑ and compute‑footprint can be reduced by compression. In this work, we focus on compression through tensorization and low‑rank representations. Whereas classical approaches search for a low‑rank approximation by minimizing an isotropic norm such as the Frobenius norm in weight‑space, we use data‑informed norms that measure the error in function space. Concretely, we minimize the change in the layer’s output distribution, which can be expressed as $\lVert (W - \widetilde{W}) \Sigma^{1/2}\rVert_F$ where $\Sigma^{1/2}$ is the square root of the covariance matrix of the layer’s input and $W$, $\widetilde{W}$ are the original and compressed weights. We propose new alternating least square algorithms for the two most common tensor decompositions (Tucker‑2 and CPD) that directly optimize the new norm. Unlike conventional compression pipelines, which almost always require post‑compression fine‑tuning, our data‑informed approach often achieves competitive accuracy without any fine‑tuning. We further show that the same covariance‑based norm can be transferred from one dataset to another with only a minor accuracy drop, enabling compression even when the original training dataset is unavailable. Experiments on several CNN architectures (ResNet‑18/50, and GoogLeNet) and datasets (ImageNet, FGVC‑Aircraft, Cifar10, and Cifar100) confirm the advantages of the proposed method.

Primary Area: Optimization (e.g., convex and non-convex, stochastic, robust)

Submission Number: 21939

Loading