Convex Distillation: Efficient Compression of Deep Networks via Convex Optimization

Prateek Varshney; Mert Pilanci

Convex Distillation: Efficient Compression of Deep Networks via Convex Optimization

Prateek Varshney, Mert Pilanci

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Convex Neural Networks, Convex/Non-Convex Optimization, Knowledge Distillation, Model Compression, Label-Free Training, Classification

TL;DR: We introduce a convex neural network distillation method that compresses large pre-trained deep networks without requiring labeled data or fine-tuning, combining convex and non-convex model architectures for efficient deployment on edge devices.

Abstract: Deploying large and complex deep neural networks on resource-constrained edge devices poses significant challenges due to their computational demands and the complexities of non-convex optimization. Traditional compression methods such as distillation and pruning often retain non-convexity that complicates fine-tuning in real-time on such devices. Moreover, these methods often necessitate extensive end-to-end network fine-tuning after compression to preserve model performance, which is not only time-consuming but also requires fully annotated datasets, thus potentially negating the benefits of efficient network compression. In this paper, we introduce a novel distillation technique that efficiently compresses the model via convex optimization -- eliminating intermediate non-convex activation functions and using only intermediate activations from the original model. Our approach enables distillation in a label-free data setting and achieves performance comparable to the original model without requiring any post-compression fine-tuning. We demonstrate the effectiveness of our method for image classification models on multiple standard datasets, and further show that in the data limited regime, our method can outperform standard non-convex distillation approaches. Our method promises significant advantages for deploying high-efficiency, low-footprint models on edge devices, making it a practical choice for real-world applications. We show that convex neural networks, when provided with rich feature representations from a large pre-trained non-convex model, can achieve performance comparable to their non-convex counterparts, opening up avenues for future research at the intersection of convex optimization and deep learning.

Supplementary Material: zip

Primary Area: optimization

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5852

Loading