Keywords: representation learning, learnability, information bottleneck
TL;DR: Theory predicts the phase transition between unlearnable and learnable values of beta for the Information Bottleneck objective
Abstract: Compressed representations generalize better (Shamir et al., 2010), which may be crucial when learning from limited or noisy labeled data. The Information Bottleneck (IB) method (Tishby et al. (2000)) provides an insightful and principled approach for balancing compression and prediction in representation learning. The IB objective I(X; Z) − βI(Y ; Z) employs a Lagrange multiplier β to tune this trade-off. However, there is little theoretical guidance for how to select β. There is also a lack of theoretical understanding about the relationship between β, the dataset, model capacity, and learnability. In this work, we show that if β is improperly chosen, learning cannot happen: the trivial representation P(Z|X) = P(Z) becomes the global minimum of the IB objective. We show how this can be avoided, by identifying a sharp phase transition between the unlearnable and the learnable which arises as β varies. This phase transition defines the concept of IB-Learnability. We prove several sufficient conditions for IB-Learnability, providing theoretical guidance for selecting β. We further show that IB-learnability is determined by the largest confident, typical, and imbalanced subset of the training examples. We give a practical algorithm to estimate the minimum β for a given dataset. We test our theoretical results on synthetic datasets, MNIST, and CIFAR10 with noisy labels, and make the surprising observation that accuracy may be non-monotonic in β.