Keywords: Fisher Information, Knowledge distillation, network quantization, loss landscape
TL;DR: We study two ways of ensuring model robustness to quantization, Fisher regularization and knowledge distillation, and show that quantized networks trained with distillation generalize better since temperature inversely scales loss surface curvature.
Abstract: A large body of work addresses deep neural network (DNN) quantization and pruning to mitigate the high computational burden of deploying DNNs. We analyze two prominent classes of methods; the first class uses regularization based on the Fisher Information Matrix (FIM) of parameters, whereas the other uses a student-teacher paradigm, referred to as Knowledge Distillation (KD). The Fisher criterion can be interpreted as regularizing the network by penalizing the approximate KL-divergence (KLD) between the output of the original model and that of the quantized model. The KD approach bypasses the need to estimate the FIM and directly minimizes the KLD between the two models. We place these two approaches in a unified setting, and study their generalization characteristics using their loss landscapes. Using CIFAR-10 and CIFAR-100 datasets, we show that for higher temperatures, distillation produces wider minima in loss landscapes and yields higher accuracy than the Fisher criterion.