Abstract: Contrastive Learning (CL) aims to create effective embeddings for input data by minimizing the distance between positive pairs, i.e., different augmentations or views of the same sample. To avoid degeneracy, CL also employs auxiliary loss to maximize the discrepancy between negative pairs formed with views of distinct samples. As a self-supervised learning strategy, CL inherently attempts to cluster input data into natural groups. However, the often improper trade-off between the attractive and repulsive forces, respectively induced by positive and negative pairs, can lead to deformed clustering, particularly when the number of clusters $k$ is unknown. To address this, we propose NRCC, a CL-based deep clustering framework that generates cluster-friendly embeddings. NRCC repurposes Stochastic Gradient Hamiltonian Monte Carlo sampling as an approximately invariant data augmentation, to curate hard negative pairs that judiciously enhance and balance the two adversarial forces through a regularizer. By preserving the cluster structure in the CL embedding, NRCC retains local density landscapes in lower dimensions through neighborhood-conserving projections. This enables the application of mode-seeking clustering algorithms, typically hindered by high-dimensional CL feature spaces, to achieve exceptional accuracy without needing a predetermined $k$. NRCC's superiority is demonstrated across various datasets with different scales and cluster structures, outperforming 20 state-of-the-art methods.
Submission Length: Long submission (more than 12 pages of main content)
Code: https://github.com/abhisheka456/NRCC
Assigned Action Editor: ~Sungwoong_Kim2
Submission Number: 2766
Loading