Keywords: contrastive learning, representation learning, multimodal alignment
TL;DR: This paper introduces a unified and theoretically grounded perspective for contrastive alignment through RKHS, providing closed-form solutions via a contrastive similarity weight matrix.
Abstract: Contrastive objectives power state-of-the-art multimodal models, but their training remains slow, relying on long stochastic optimization.
We propose a Unified Framework for Efficient Contrastive Alignment via Kernels (UniCon), which spans linear and nonlinear encoders as well as one-to-one and many-to-many alignments.
At its core, UniCon introduces the contrastive similarity weight matrix $S(\gamma)$, which enables closed-form global solutions that provably replace minibatch back-propagation with exact updates. Through the lens of reproducing kernel Hilbert spaces (RKHS), UniCon provides a kernelized perspective that unifies contrastive alignment and reveals its connection to spectral methods.
To validate the theory, we conduct experiments on synthetic, unimodal, multimodal, and zero-shot tasks, demonstrating that UniCon achieves substantial efficiency gains while preserving generality and strong empirical performance.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 1806
Loading