Scalable Mean-Field Variational Inference through Deterministic Teacher-Guided Knowledge Distillation
Abstract: Scaling mean-field variational inference (MFVI) to large-scale deep networks remains challenging: in high dimensions, Gaussian posteriors concentrate probability mass on thin hyperspherical shells far from the mean (the ``soap-bubble'' phenomenon), so sampled weights produce high-variance gradient estimates that destabilize ELBO training. On ImageNet, vanilla MFVI achieves only 21.57\% top-1 accuracy with a ResNet-18 backbone, indicating severe optimization degradation relative to its deterministic counterpart. We introduce a decoupled training framework that addresses this issue at the level of the optimization objective. A pretrained teacher network supervises the mean parameters $\mu$ through an output-space knowledge distillation term, while the variance parameters $\rho$ are optimized exclusively through the standard ELBO. The combined objective admits a hybrid MAP-VI interpretation in which the distillation term defines a teacher-induced pseudo-prior on $\mu$, recovering standard MFVI as a special case. Under standard local regularity conditions, we prove a dimension-independent confinement bound showing that the optimum lies within a controllable neighborhood of the teacher's effective solution. Empirically, the method reaches 71.58\% top-1 accuracy and ECE 0.0251 on ImageNet with ResNet-18, and is complementary to existing techniques such as MOPED and Radial reparameterization.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Oleg_Arenz1
Submission Number: 8845
Loading