On the Relationship Between Neural Tangent Kernel Frobenius Distance and Distillation Sample Complexity
Keywords: Model Distillation, Neural Tangent Kernal, Centered Kernal Alignment, Sample Complexity, Language Models
TL;DR: This paper links distillation difficulty to the distance between teacher and student Neural Tangent Kernels (NTKs), proposing Centered Kernel Alignment (CKA) as a practical proxy to predict it.
Abstract: Knowledge distillation is a popular method for compressing large neural networks, from large language models to computer vision models, into smaller, more efficient models. However, predicting the effectiveness of a distillation for any given teacher-student pair without incurring expensive training costs is a significant challenge. This concept is also relevant when designing models intended to resist distillation, a case common when developers try to protect their intellectual property. To address this, we propose a theoretical framework that connects the properties of a teacher model to the inherent difficulty of distillation. Our work is centered on the conjecture that, under Neural Tangent Kernel (NTK) assumptions, this difficulty is lower bounded by the distance between the teacher and student kernel matrices. We then propose Centered Kernel Alignment (CKA) as a computable proxy for this conjectured bound, based on the heuristic assumption that representation similarity reflects the similarity of the models' learning dynamics. This framework offers mathematical tools to estimate the feasibility of distillation prior to experimentation.
Submission Number: 53
Loading