Layerwise Bregman Representation Learning of Neural Networks with Applications to Knowledge Distillation

Published: 20 Feb 2023, Last Modified: 28 Feb 2023Accepted by TMLREveryoneRevisionsBibTeX
Abstract: We propose a new method for layerwise representation learning of a trained neural network that conforms to the non-linearity of the layer's transfer function. In particular, we form a Bregman divergence based on the convex function induced by the layer's transfer function and construct an extension of the original Bregman PCA formulation by incorporating a mean vector and revising the normalization constraint on the principal directions. These modifications allow exporting the learned representation as a fixed layer with a non-linearity. As an application to knowledge distillation, we cast the learning problem for the student network as predicting the compression coefficients of the teacher's representations, which is then passed as the input to the imported layer. Our empirical findings indicate that our approach is substantially more effective for transferring information between networks than typical teacher-student training that uses the teacher's soft labels.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have tried to address as many concerns as possible. The changes are marked as red in the updated draft. In particular, - We outline our work as a layerwise representation learning method for deep neural networks with strictly monotonic transfer functions. Specifically, we emphasize the importance of representation learning and knowledge transfer in the introduction. We then motivate our work as a way of exporting compressed representations from trained neural networks that has an application for knowledge distillation. We have rearranged the sections and moved some of the technical details to the appendix. - We have included additional references and related work. - We have added the number of parameters and the runtime for our large-scale ImageNet-1k experiments. - Finally, we have included an additional section in the appendix on the matching loss of a transfer function. Post acceptance: We have fixed several typos. We have added additional references and removed duplicates. We also made the notation more consistent.
Assigned Action Editor: ~Stephan_M_Mandt1
Submission Number: 459