Keywords: Spectral Clustering, Vision-Language Models, Neural Tangent Kernel
Abstract: Spectral clustering is known as a powerful technique in unsupervised data analysis.
The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped.
Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime.
Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models.
By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap.
We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures.
In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts.
Extensive experiments on \textbf{16} benchmarks---including classical, large-scale, fine-grained and domain-shifted datasets---manifest that our method consistently outperforms the state-of-the-art by a large margin.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 1386
Loading