- Keywords: Generalization, statistical learning theory, theory, distillation
- Abstract: This paper provides a suite of mathematical tools to bound the generalization error of networks that possess low-complexity distillations --- that is, when there exist simple networks whose softmax outputs approximately match those of the original network. The primary contribution is the aforementioned bound, which upper bounds the test error of a network by the sum of its training error, the distillation error, and the complexity of the distilled network. Supporting this, secondary contributions include: a generalization bound which can handle convolutions and skip connections, a generalization analysis of the compression step leading to a bound with small width- and depth-dependence via weight matrix stable ranks, and a sampling theorem to sparsify dense networks. The bounds and their behavior are illustrated empirically on the standard mnist and cifar datasets.
- One-sentence Summary: This paper provides a suite of mathematical tools to bound the generalization error of networks which possess low-complexity distillations.
- Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics