Keywords: Knowledge distillation, Neural Network (NN) Feature learning, ensembling NNs, Deep learning fundamentals, Image classification
Abstract: Trained Neural Networks (NNs) can be viewed as data-dependent kernel machines, with predictions determined by the inner product of last-layer representations across inputs, referred to as the feature kernel. We explore the relevance of the feature kernel for Knowledge Distillation (KD), using a mechanistic understanding of an NN’s optimisation process. We extend the theoretical analysis of Allen-Zhu & Li (2020) to show that a trained NN’s feature kernel is highly dependent on its parameter initialisation, which biases different initialisations of the same architecture to learn different data attributes in a multi-view data setting. This enables us to prove that KD using only pairwise feature kernel comparisons can improve NN test accuracy in such settings, with both single & ensemble teacher models, whereas standard training without KD fails to generalise. We further use our theory to motivate practical considerations for improving student generalisation when using distillation with feature kernels, which allows us to propose a novel approach: Feature Kernel Distillation (FKD). Finally, we experimentally corroborate our theory in the image classification setting, showing that FKD is amenable to ensemble distillation, can transfer knowledge across datasets, and outperforms both vanilla KD & other feature kernel based KD baselines across a range of standard architectures & datasets.
One-sentence Summary: A feature-learning perspective of (ensemble) Knowledge Distillation (KD) in Neural Networks to propose a new method (FKD), with both theoretical & experimental results demonstrating FKD's advantages over standard KD baselines.