Keywords: Constrained Clustering, Mixed-Integer Optimization
TL;DR: We introduce SDC-GBB, a deterministic branch-and-bound algorithm for pairwise‐constrained k-means that guarantees global optimality on cannot-link-only and mixed‐constraint datasets up to 200K samples, and must-link-only datasets up to 1.5M samples.
Abstract: Constrained clustering leverages limited domain knowledge to improve clustering performance and interpretability, but incorporating pairwise must‑link and cannot‑link constraints is an NP‑hard challenge, making global optimization intractable. Existing mixed‑integer optimization methods are confined to small‑scale datasets, limiting their utility. We propose Sample-Driven Constrained Group-Based Branch-and-Bound (SDC-GBB), a decomposable branch‑and‑bound (BB) framework that collapses must‑linked samples into centroid‑based pseudo‑samples and prunes cannot‑link through geometric rules, while preserving convergence and guaranteeing global optimality. By integrating grouped-sample Lagrangian decomposition and geometric elimination rules for efficient lower and upper bounds, the algorithm attains scalability via embarrassingly simple parallelism. Experimental results show that our approach handles datasets with 200,000 samples for cannot-link constraints and 1,500,000 samples for must-link constraints, which is 200 - 1500 times larger than the current state-of-the-art under comparable constraint settings, while reaching an optimality gap of <= 3%. In providing deterministic global guarantees, our method also avoids the search failures that off‑the‑shelf heuristics often encounter on large datasets.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 14782
Loading