- Keywords: Bayesian Neural Networks, Distillation
- TL;DR: A general framework for distilling Bayesian posterior expectations for deep neural networks.
- Abstract: In this paper, we present a general framework for distilling expectations with respect to the Bayesian posterior distribution of a deep neural network, significantly extending prior work on a method known as ``Bayesian Dark Knowledge." Our generalized framework applies to the case of classification models and takes as input the architecture of a ``teacher" network, a general posterior expectation of interest, and the architecture of a ``student" network. The distillation method performs an online compression of the selected posterior expectation using iteratively generated Monte Carlo samples from the parameter posterior of the teacher model. We further consider the problem of optimizing the student model architecture with respect to an accuracy-speed-storage trade-off. We present experimental results investigating multiple data sets, distillation targets, teacher model architectures, and approaches to searching for student model architectures. We establish the key result that distilling into a student model with an architecture that matches the teacher, as is done in Bayesian Dark Knowledge, can lead to sub-optimal performance. Lastly, we show that student architecture search methods can identify student models with significantly improved performance.