Layer-wise Knowledge Distillation from a Pretrained Network Improves Hypernetwork Convergence

Prabhash Kumarasinghe; Bernd Meyer; Anuja Dharmaratne

Layer-wise Knowledge Distillation from a Pretrained Network Improves Hypernetwork Convergence

Prabhash Kumarasinghe, Bernd Meyer, Anuja Dharmaratne

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Hypernetwork, knowledge distillation, hypernet convergence, classification

TL;DR: Layerwise knowledge distillation improves hypernetwork convergence of Magnitude Invariant Parameterisations, with AT and AB methods enabling deep architectures like VGG19 to approach canonical training performance.

Abstract: Hypernetworks that generate weights of another network often exhibit lower test accuracy and slower convergence due to implicit weight updates. The recently proposed HyperLight framework (Magnitude Invariant Parameterisations, MIP) addresses this convergence issue by bounding the scale of the hypernetwork's input encoding using sine-cosine transforms and by introducing additive weights. Preliminary experiments revealed that when deeper primary networks are fully hypernetised, MIP achieves lower test accuracy compared to a canonically trained network. This paper investigates layer-wise knowledge distillation methods for hypernetwork training by bridging the hypernetised layers with a pretrained Teacher network of the same architecture. Nine layer-wise KD methods (Feature-KD) -- AB, AT, CwD, FitNets, FSP, FT, JacobianKD, RKD, and SP -- were evaluated on the shufflenetv2_0x5 architecture for the CIFAR-100 classification task. The two best-performing methods, AB-KD (Activation Boundary) and AT-KD (Attention Transfer), were further evaluated on nine additional deep networks, including ShuffleNet, ResNet, MobileNet, VGG, and Reparameterised VGG. Experiments reveal that AT and AB methods applied to MIP hypernetworks improve performance even for fully hypernetised deeper networks such as VGG19. For example, AB-KD with MIP achieved a test accuracy of 72.65%, only 1.22% lower than the canonically trained Teacher at 73.87%, compared to the MIP baseline accuracy of 11%.

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 10820

Loading