- Abstract: Deep neural networks have achieved state-of-the-art performance in various fields, but they have to be scaled down to be used for real-world applications. As a means to reduce the size of a neural network while preserving its performance, knowledge transfer has brought a lot of attention. One popular method of knowledge transfer is knowledge distillation (KD), where softened outputs of a pre-trained teacher network help train student networks. Since KD, other transfer methods have been proposed, and they mainly focus on loss functions, activations of hidden layers, or additional modules to transfer knowledge well from teacher networks to student networks. In this work, we focus on the structure of a teacher network to get the effect of multiple teacher networks without additional resources. We propose changing the structure of a teacher network to have stochastic blocks and skip connections. In doing so, a teacher network becomes the aggregate of a huge number of paths. In the training phase, each sub-network is generated by dropping stochastic blocks randomly and used as a teacher network. This allows training the student network with multiple teacher networks and further enhances the student network on the same resources in a single teacher network. We verify that the proposed structure brings further improvement to student networks on benchmark datasets.
- Keywords: deep learning, knowledge transfer, model compression
- TL;DR: The goal of this paper is to get the effect of multiple teacher networks by exploiting stochastic blocks and skip connections.