On Difficulties of Probability Distillation

Chin-Wei Huang; Faruk Ahmed; Kundan Kumar; Alexandre Lacoste; Aaron Courville

On Difficulties of Probability Distillation

Chin-Wei Huang, Faruk Ahmed, Kundan Kumar, Alexandre Lacoste, Aaron Courville

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: Probability distillation has recently been of interest to deep learning practitioners as it presents a practical solution for sampling from autoregressive models for deployment in real-time applications. We identify a pathological optimization issue with the commonly adopted stochastic minimization of the (reverse) KL divergence, owing to sparse gradient signal from the teacher model due to curse of dimensionality. We also explore alternative principles for distillation, and show that one can achieve qualitatively better results than with KL minimization.

Keywords: Probability distillation, Autoregressive models, normalizing flows, wavenet, pixelcnn

TL;DR: We point out an optimization issue of distillation with KL divergence, and explore different alternatives

7 Replies

Loading