Relative Error Fair Clustering in the Weak-Strong Oracle Model

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We give the first $(1+\varepsilon)$-coreset for fair $k$-median in the weak-strong oracle model.
Abstract: We study fair clustering problems in a setting where distance information is obtained from two sources: a strong oracle providing exact distances, but at a high cost, and a weak oracle providing potentially inaccurate distance estimates at a low cost. The goal is to produce a near-optimal fair clustering on $n$ input points with a minimum number of strong oracle queries. This models the increasingly common trade-off between accurate but expensive similarity measures (e.g., large-scale embeddings) and cheaper but inaccurate alternatives. The study of fair clustering in the model is motivated by the important quest of achieving fairness with the presence of inaccurate information. We achieve the first $(1+\varepsilon)$-coresets for fair $k$-median clustering using $\text{poly}\left(\frac{k}{\varepsilon}\cdot\log n\right)$ queries to the strong oracle. Furthermore, our results imply coresets for the standard setting (without fairness constraints), and we could in fact obtain $(1+\varepsilon)$-coresets for $(k,z)$-clustering for general $z=O(1)$ with a similar number of strong oracle queries. In contrast, previous results achieved a constant-factor $(>10)$ approximation for the standard $k$-clustering problems, and no previous work considered the fair $k$-median clustering problem.
Lay Summary: Clustering—grouping similar data points together—is a central task in machine learning. It's widely used in real-world applications, like recommending movies, summarizing customer data, or grouping job applicants. But clustering can go wrong in two key ways: it can be unfair (for example, placing people of one gender or race disproportionately in certain clusters), and it can be expensive when relying on accurate but costly computations. This paper tackles both problems at once. The authors study how to cluster data fairly while using a mix of cheap, potentially inaccurate information (from a weak source) and expensive, reliable information (from a strong source). Think of weak information as a quick estimate and strong information as the expensive but accurate output from a powerful model. Their main contribution is an efficient method to select a small, weighted summary of the data—called a coreset — that can be clustered instead of the full dataset. Crucially, this coreset supports fair clustering (ensuring balanced representation of subgroups like gender or race), and it can be built using only a small number of expensive strong queries. This means fairness and accuracy can be achieved without a huge computational cost. In short, this work helps make fairness in AI more affordable and practical—especially in settings where resources are limited and data is messy or unreliable.
Link To Code: https://drive.google.com/drive/folders/1DLuPMpNe01JHB6pM1a-ZeCY478KNdpAC?usp=sharing
Primary Area: Social Aspects->Fairness
Keywords: clustering, weak-strong oracle model, fairness
Submission Number: 9189
Loading