EL-Clustering: Combining Upper- and Lower-Bounded Clusterings for Equitable Load Constraints

Published: 07 Oct 2025, Last Modified: 07 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The application of an ordinary clustering algorithm may yield a clustering output where the number of points per cluster (cluster size) varies significantly. In settings where the centers correspond to facilities that provide a service, this can be highly undesirable as the cluster size is essentially the service load for a facility. While prior work has considered imposing either a lower bound on the cluster sizes or an upper bound, imposing both bounds simultaneously has seen limited work, especially for the $k$-median objective, despite its strong practical motivation. In this paper, we solve the \emph{equitable load} (\EL{}) clustering problem where we minimize the $k$-median objective subject to the cluster sizes not exceeding an upper bound or falling below a lower bound. We solve this problem using a modular approach. Specifically, given a clustering solution that satisfies the lower bound constraints and another that satisfies the upper bound constraints, we introduce a combination algorithm which essentially combines both solutions to produce one that satisfies both constraints simultaneously at the expense of a bounded degradation in the $k$-median objective and a slight violation of the upper bound. Our combination algorithm runs in $O(k^3+n)$ time, where $n$ is the number of points and is faster than standard $k$-median algorithms that satisfy either the lower or upper bound constraints. Interestingly, our results can be generalized to various other clustering objectives, including the $k$-means objective. We also do empirical evaluation for $k$-Median objective on benchmark datasets to show that both, the cost as well as the violation factor are significantly smaller in practice than the theoretical worst-case guarantees\footnote{https://github.com/0-rudra-0/el-clustering}.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Revised Rebuttal to Reviewer 3 for "empirical evaluation and the impact of upper bound violations and large approximation factors": - We have added a new "Experiments" section (Appendix C) with empirical evaluation on benchmark datasets (Adult, Diabetes, Bank). - Incorporated results showing that in practice, the cost and violation factors are much smaller than the theoretical guarantees. - All source code, logs, and charts are publicly available on the github link provided in the revised paper. The previous rebuttal submitted on "August 18, 2025" is as follows: Reviewer 1: Question on Non-uniform bounds: As we briefly mention in the current conclusion, our core lemmas and the main theorem hold if one of the bounds is uniform across all facilities (e.g., facility-specific lower bounds L_i​ with a uniform upper bound U, or vice-versa). However, extending our proof to handle the fully general case, where both L_i​ and U_i​ are facility-specific, is a more complex challenge that we believe is an interesting direction for future research. Results table: added a summary table (Appendix B) with approximation ratios and violation factors yielded by plugging state-of-the-art $S_U$ and $S_L$ routines, with citations. Reviewer 2: Terms for non-experts: added explicit definitions in Section 2 for “approximation factor” and “violation factor”. Ques on LP: The standard LP relaxation for EL Clustering has an unbounded integrality gap even if one of the bounds is allowed to be violated. However, strengthening techniques may be used to explore a possible solution. Section 5 readability: inserted a short Definitions paragraph (star, star-center, spokes, open/closed), removed the confusing $y_\ell$ reference in prose, added a toy example. Runtime: after Theorem 3.1, stated that the reported $O(k^3+n)$ time is only for the combination step that merges given $S_U$ and $S_L$; computing $S_U,S_L$ depends on the chosen UkM/LkM methods. Title: Changed to “EL-Clustering: Combining Upper- and Lower-Bounded Clusterings for Equitable Load Constraints.” Appendix citations: Appendix B now includes references for the $S_U$ and $S_L$ algorithms in a summary table. Reviewer 3: Early definitions: inserted a short Definitions paragraph in Section 5. Post–Corollary 3.2 rewrite: We have substantially reorganized the text following Corollary 3.2. The new version presents in different paragraphs: (i) a concise summary of the approximation guarantee, (ii) discussion on how the combination technique works irrespective of the technique used to solve underlying problems, (iii) a brief discussion of running time and (iv) extension to other objectives and special case. Redundant phrases have been removed. Major Section 5 revision: Section 5 has been rewritten to focus on intuitive explanation of the algorithm. The revised section contains a high-level description of our algorithm’s overall strategy with clearly labeled paragraphs (e.g., grouping into Stars, processing stars and Processing Order via Dependency Graph). We have also added a simple toy example to explain the processing of a star to address the concern of another reviewer. Reduced repetition and citation cleanup: The discussion of modular approach appears for the first time in introduction (before oragnization of the paper). We have removed the repeated explanation appearing in Section 3, paragraph 1– “We present a modular technique that combines the solutions of the lower bounded variant and the upper bounded variant of the problem to obtain our result stated in Theorem 3.1.” Again in the second-to-last paragraph of Section 3, the word “modular” is removed to decrease the redundancy. Empirical evaluation and the impact of upper bound violations and large approximation factors: Our paper is primarily theory-focused: the main contribution is in developing algorithms with provable guarantees. We would ideally like the work to stand on these theoretical results alone. That said, if the committee considers experiments essential, we are willing to include a small-scale evaluation on synthetic instances to illustrate the practical viability of our approach, though we would need some additional time to prepare this. Regarding the violation and approximation factors, we note that these are worst-case bounds and can be smaller in practice. Repeated citations were a formatting issue for which changes similar to the ones shown below are done at various places. For example: Section 1 Para 2 third line: Chhabra et al. Chhabra et al. (2020) and Cinà et al. Cinà et al. (2022) —> Chhabra et al. (2020) and Cinà et al. (2022) Language and consistency pass: We have carefully edited the entire manuscript to improve grammar and tone. Informal expressions have been replaced with precise technical language; edited for grammar, tone, punctuation, hyphenation, and consistent notation. Details of the changes can be found in the official comment.
Code: https://github.com/0-rudra-0/el-clustering
Assigned Action Editor: ~Sivan_Sabato1
Submission Number: 5220
Loading