Beyond Single Views: Achieving Significant Gains in Text Clustering via Informative Diversification

Johny Gemayel; Mohamad Assaad; Ali Mokh; Diarmuid Corcoran

Beyond Single Views: Achieving Significant Gains in Text Clustering via Informative Diversification

Johny Gemayel, Mohamad Assaad, Ali Mokh, Diarmuid Corcoran

11 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Informative Diversification, Consensus Clustering, Multi-View Embeddings, Gaussian Mixture Models, Contrastive Learning

TL;DR: We show that “many views are better than one” — by diversifying embeddings and co-training with consensus, we get clusters that are sharper, provably more accurate, and generalize even to unseen text.

Abstract: Clustering text into coherent groups is a long-standing challenge, complicated by high-dimensional embeddings, semantic ambiguity, and distributional shifts in unseen data. Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) systems have further underscored the need for robust and scalable knowledge representation methods. In this work, we introduce a novel clustering framework based on informative diversification. Our method applies a set of semantic-preserving transformations to generate multiple views of the data, and then harnesses their collective structure through a spectral consensus process. We prove that consensus clustering achieves an exponentially lower expected error rate compared to any single view, provided the views are diverse and informative. We then propose an iterative co-training procedure that learns a cluster-friendly latent space by jointly minimizing a contrastive InfoNCE loss and a Gaussian mixture negative log-likelihood loss. This training sharpens assignments and pulls embeddings toward their cluster centroids, while dynamically updating cluster assignments to accommodate the evolving latent space. The result is a robust and generalizable model that not only outperforms baselines on benchmark datasets but also maintains strong accuracy on unseen text, making it a powerful tool for real-world knowledge discovery and retrieval-augmented generation systems.

Primary Area: probabilistic methods (Bayesian methods, variational inference, sampling, UQ, etc.)

Submission Number: 4162

Loading