Keywords: kernel methods, deep learning theory, convolution, approximation, generalization
Abstract: The empirical success of deep convolutional networks on tasks involving high-dimensional data such as images or audio suggests that they can efficiently approximate certain functions that are well-suited for such tasks. In this paper, we study this through the lens of kernel methods, by considering simple hierarchical kernels with two or three convolution and pooling layers, inspired by convolutional kernel networks. These achieve good empirical performance on standard vision datasets, while providing a precise description of their functional space that yields new insights on their inductive bias. We show that the RKHS consists of additive models of interaction terms between patches, and that its norm encourages spatial similarities between these terms through pooling layers. We then provide generalization bounds which illustrate how pooling and patches yield improved sample complexity guarantees when the target function presents such regularities.
One-sentence Summary: We study the inductive bias of multi-layer convolutional models through a kernel lens, showing generalization benefits of various architectural choices such as locality, depth, and pooling layers.
Supplementary Material: zip