Keywords: representation learning, contrastive learning, clustering, dimensionality reduction
Abstract: The choice of optimization objective fundamentally determines the success of representation learning methods, yet the field has converged on a single statistical divergence measure without systematic exploration of alternatives. The Information Contrastive (I-Con) framework recently revealed that over 23 diverse representation learning methods all implicitly minimize KL divergence between data and learned distributions that encode similarities between data points.
This usage of KL divergence may lead to suboptimal optimization of representations. KL divergence is known to cause optimization issues due to properties such as asymmetry and the possibility of infinite values. Furthermore, since loss functions are only proxies for the actual goal of the task, optimizing for KL divergence may be misaligned with the true objective.
We present Beyond I-Con, making the following contributions: (1) We generalize I-Con by replacing KL divergence with alternative f-divergences, revealing that KL is not unique in enabling meaningful feature optimization; (2) We systematically explore combinations of f-divergence and similarity kernel, uncovering novel loss functions with superior performance on unsupervised clustering, supervised contrastive learning, and dimensionality reduction. Specifically, we demonstrate that total variation distance when paired with distance similarity kernels achieves superior performance compared to KL-based approaches.
We achieve the following results: (1) on unsupervised clustering of DINO-ViT embeddings, we achieve state-of-the-art results by modifying the PMI algorithm to use total variation (TV) distance; (2) on supervised contrastive learning, we outperform the standard approach by using TV and a distance-based similarity kernel instead of KL and an angular kernel; (3) on dimensionality reduction, we achieve superior qualitative results and better performance on downstream tasks than SNE by replacing KL with a bounded f-divergence. Our results highlight the importance of considering divergence and similarity kernel choices in representation learning optimization.
Submission Number: 104
Loading