Keywords: Edge computing, compression, efficient inference, distillation and inference, run-time tradeoff, inference-time tradeoff, on-device, user-side, client-side
TL;DR: A method to distill a large pretrained model onto a series on smaller ones that can be used to trade off accuracy for latency at runtime/inference time.
Abstract: Knowledge distillation is commonly used to compress an ensemble of models into a single model. In this work we study the problem of progressive ensemble distillation: Given a large, pretrained teacher model , we seek to decompose the model into an ensemble of smaller, low-inference cost student models . The resulting ensemble allows for flexibly tuning accuracy vs. inference cost, which can be useful for a multitude of applications in efficient inference. Our method, B-DISTIL, uses a boosting procedure that allows function composition based aggregation rules to construct expressive ensembles with similar performance as using much smaller student models. We demonstrate the effectiveness of B-DISTIL by decomposing pretrained models across a variety of image, speech, and sensor datasets. Our method comes with strong theoretical guarantees in terms of convergence as well as generalization.
Submission Number: 13698
Loading