Keywords: contrastive learning, kernel methods, Markov chain, eigenfunction, spectral decomposition, representation learning
TL;DR: Contrastive learning methods approximate a positive-definite kernel, and Kernel PCA extracts an optimal representation for learning functions that have similar values on positive pairs.
Abstract: Contrastive learning is a powerful framework for learning self-supervised representations that generalize well to downstream supervised tasks. We show that multiple existing contrastive learning methods can be reinterpeted as learning a positive-definite kernel that approximates a particular *contrastive kernel* defined by the positive pairs. The principal components of the data under this kernel exactly correspond to the eigenfunctions of a positive-pair Markov chain, and these eigenfunctions can be used to build a representation that provably minimizes the worst-case approximation error of linear predictors under the assumption that positive pairs have similar labels. We give generalization bounds for downstream linear prediction using this optimal representation, and show how to approximate this representation using kernel PCA. We also explore kernel-based representations on a noisy MNIST task for which the positive pair distribution has a closed form, and compare the properties of the true eigenfunctions with their learned approximations.