['1c1', '< Title: Random Cuts are Optimal for Explainable k-Medians', '---', '> Title: Optimal Explainable k-Medians via Random Coordinate Cuts', '3c3', '< Abstract: We show that the RANDOMCOORDINATECUT algorithm gives the optimal competitive ratio for explainable k-medians in ℓ 1 . The problem of explainable k-medians was introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian in 2020. Several groups of authors independently proposed a simple polynomial-time randomized algorithm for the problem and showed that this algorithm is O(log k log log k) competitive. We provide a tight analysis of the algorithm and prove that its competitive ratio is upper bounded by 2 ln k + 2. This bound matches the Ω(log k) lower bound by Dasgupta et al (2020). Note that the running time of this algorithm is Õ(kd). Gamlath, Jia, Polak, and Svensson [2021] provided a slightly worse bound of O(log 2 k) on the competitive ratio of this algorithm. They also', '---', '> Abstract: Explainable k-medians clustering is a fundamental problem in unsupervised learning with increasing importance in critical applications. The RANDOMCOORDINATECUT algorithm, a simple polynomial-time randomized approach, has been independently proposed by several groups for this problem. While previous analyses established an O(log k log log k) competitive ratio, its optimality remained a conjecture. In this paper, we provide a tight analysis of the RANDOMCOORDINATECUT algorithm, proving that its competitive ratio is optimally bounded by 2 ln k + 2. This bound precisely matches the Ω(log k) lower bound established by Dasgupta et al. (2020), resolving an open question and significantly improving upon prior analyses, including a slightly worse O(log²k) bound by Gamlath, Jia, Polak, and Svensson (2021). Our work demonstrates that random coordinate cuts achieve the best possible competitive ratio for explainable k-medians in ℓ₁ norm.', '6,23c6', "< Machine learning is being increasingly used to make decisions for critical applications, such as healthcare, finance, and public policy. Considering the profound impact of algorithmic decisions on individuals and society, it is essential to understand the underlying logic behind these decisions. In this paper, we explore an explainable k-medians clustering algorithm (called RANDOMCOORDINATE-CUT). The algorithm's aim is to cluster data sets and present results in a manner easily understood and visualized by humans.", '< Clustering is a fundamental task in unsupervised learning. Among many clustering methods, k-means, k-medians, and k-medoids are particularly popular. These are centroid-based methods that choose k centers and assign each data point to the center nearest to it. As a result, each cluster is a Voronoi cell in the Voronoi partition of the space. Since these cells may have a complicated boundary (see Figure 1 for an example of k-medians), it is not always easy for humans to comprehend and visualize such clustering.', '< To address this problem, Dasgupta, Frost, Moshkovitz, and Rashtchian [2020] introduced explainable k-means and k-medians clustering. They argued that decision trees are easy to understand and interpret. Therefore, in order to make clustering more explainable, we need to use threshold decision trees to define clusters. A threshold decision tree is a binary space partitioning tree with k leaves. Each internal node of the threshold decision tree splits the data into two groups using a threshold cut (j, θ): on the one side of the cut, we have points x with x j ≤ θ and, on the other side, points x with x j > θ. Thus, every node of the tree corresponds to a rectangular region of the space. A decision tree with k leaves partitions data set X into k clusters, P 1 , . . . , P k . See Figure 1 for an example. Dasgupta et al. [2020] suggested that we use the standard k-medians and k-means objectives to measure the cost of the threshold decision tree. For k-medians, the cost of a threshold decision tree T equals cost(X, T ) = k i=1 x∈Pi ∥x -ĉi ∥ 1 , where P 1 , . . . , P k is the partitioning of X produced by T ; and ĉ1 , . . . , ĉk are the medians of clusters P 1 , . . . , P k . We denote the ℓ 1 -norm by ∥ • ∥ 1 . Note that each P i is a rectangular region of the space. Thus, generally speaking, every x is not assigned to the closest center ĉ1 , . . . , ĉk like in unconstrained k-medians or k-means. y ≤ 8.6 1', '< x ≤ -1.9', '< 2 3  Dasgupta, Frost, Moshkovitz, and Rashtchian [2020] defined the price of explainability as the ratio of the k-medians cost of explainable clustering to the optimal cost of unconstrained k-medians clustering. They showed that the cost of explainability for k-means and k-medians (somewhat surprisingly) does not depend on the number of points in the data set X and only depends on k. Specifically, they provided a greedy algorithm that given k reference centers c 1 , c 2 , • • • , c k of any unconstrained k-medians as input, outputs a threshold decision tree of cost at most O(k) times the cost of original unconstrained k-medians with centers c 1 , c 2 , • • • , c k . We call such an algorithm O(k) competitive.', '< To get an explainable k-medians clustering, we first obtain reference centers c 1 , c 2 , • • • , c k using an off-the-shelf approximation algorithm for k-medians and then run an α-competitive algorithm for explainable k-medians with centers c 1 , c 2 , • • • , c k given as input. This algorithm produces the desired threshold decision tree. Dasgupta et al. [2020] also gave an O(k 2 ) competitive algorithm for k-means and showed Ω(log k) lower bounds on the price of explainability for both k-medians and k-means.', '< The notion of explainable clustering immediately got a lot of attention in the field (Laber and Murtinho [2021], Makarychev and Shan [2021], Gamlath et al. [2021], Charikar and Hu [2022], Esfandiari et al. [2022]). Particularly, Makarychev and Shan [2021], Esfandiari, Mirrokni, and Narayanan [2022] provided almost optimal algorithms for explainable k-medians, and Makarychev and Shan [2021], Esfandiari, Mirrokni, and Narayanan [2022], Gamlath, Jia, Polak, and Svensson [2021] provided almost optimal algorithms for k-means. The competitive ratios of these algorithms are Õ(log k) for k-medians and Õ(k) for k-means.', '< The algorithms for explainable k-medians by Makarychev and Shan [2021], Esfandiari, Mirrokni, and Narayanan [2022], Gamlath, Jia, Polak, and Svensson [2021] are variants of the same simple algorithm, which we call RANDOMCOORDINATECUT. This algorithm receives a set of k reference centers c 1 , . . . , c k as input and then builds a threshold decision tree with k leaves. It works as follows.', '< It recursively partitions d-dimensional space until every cell contains exactly one reference center c i . The algorithm starts with a tree consisting of one node, the root. Initially, all k reference centers are assigned to that root. At every step, the algorithm picks a random threshold cut (j, θ) and splits centers in every cell using this cut. If this cut does not separate any centers in a cell u (i.e., all centers in u are located on one side of the cut), then the algorithm does not split u into two regions at this step. Finally, for every leaf u of the constructed tree, the unique center that belongs to the cell corresponding to u is assigned to u. We provide pseudo-code for this algorithm in Figure 2. Makarychev and Shan [2021], Esfandiari et al. [2022] showed that the competitive ratio of RAN-DOMCOORDINATECUT is at most O(log k log log k). That is, for every data set X and set of centers', '< c 1 , . . . , c k , E[cost(X, T )] ≤ O(log k log log k) • Input: a data set X ⊂ R d and set of centers C = {c 1 , c 2 , . . . , c k } ⊂ R d Output: a threshold tree T Create tree T 0 containing a root node r. Assign C r = {c 1 , c 2 , • • • , c k } to the root. Let t = 0. Let M = max ij |c i j |.', '< while T n contains a leaf with at least two distinct centers do Pick a coordinate j and threshold θ ∈ (-M, M ) uniformly at random. Let ω n = (j, θ).', '< For every leaf node u in T n , split the set C u into two sets:', '< Left = {c ∈ C u : c j ≤ θ} and Right = {c ∈ C u : c j > θ}.', '< If both sets are not empty, then create two children of u in tree T t . The left child corresponds to the subregion of u with x j ≤ θ, and the right child corresponds to the subregion of u with x j > θ. Assign sets Left and Right to the left and right child, respectively.', '< Denote the updated tree by T t+1 . Update t = t + 1. end while Figure 2: RANDOMCOORDINATECUT algorithm conjectured that this algorithm is optimal and its competitive ratio is O(log k), more specifically, H k-1 + 1, where H k is the k-th harmonic number. They provided some justification for their conjecture by proving this bound for a very special set of centers and data points (corresponding to the case of completely disjoint sets in our Set Elimination Game).', '< Our Results. In this work, we show that indeed the competitive ratio of RANDOMCOORDINATECUT is at most 2 ln k + 2, and, therefore, this algorithm has the optimal competitive ratio which matches the lower bound of Dasgupta, Frost, Moshkovitz, and Rashtchian [2020]. Our analysis is not only tight but also fairly simple. To get our result we define a game, the Set Elimination Game, which was also implicitly analyzed in previous works on this topic. We show that the cost of this game is at most 2 ln k + 2. Related Work. The unconstrained k-medians clustering has been extensively studied. Charikar, Guha, Tardos, and Shmoys [1999] gave the first constant factor approximation algorithm for the problem in general metric spaces. Li and Svensson [2013] provided a 1 + √ 3 + ε approximation algorithm. Byrka, Pensyl, Rybicki, Srinivasan, and Trinh [2017] improved the approximation factor to 2.675 + ε. Cohen-Addad, Esfandiari, Mirrokni, and Narayanan [2022] recently improved the approximation factor to 2.406 for Euclidean k-medians. Megiddo and Supowit [1984] showed that the k-medians in ℓ 1 problem is NP-hard. Cohen-Addad and Lee [2022] showed that it is also NP-hard to approximate k-medians in ℓ 1 within a factor of 1.06.', '< As we discuss above, Gamlath, Jia, Polak, and Svensson [2021], Esfandiari, Mirrokni, and Narayanan [2022], Makarychev and Shan [2021], independently proposed the RANDOMCOORDINATECUT algorithm. They also gave an Õ(k) algorithm for explainable k-means and showed a lower bound of Ω(k) for the problem. Charikar and Hu [2022] provided an O(k 1-2/d • poly(d, log k)) competitive algorithm for explainable k-means, whose competitive ratio depends on the dimension d of the instance. For small d ≪ log k/ log log k, their bound is better than O(k). They showed an almost matching Ω(k 1-2/d /ploy log k) lower bound for explainable k-means. Esfandiari et al. [2022] gave an upper bound of O(d log 2 d) on the competitive ratio of RANDOMCOORDINATECUT for explainable k-medians. This bound is better than O(log k) for small d ≪ log k/ log log k. Laber and Murtinho [2021] gave O(d log k) and O(dk log k) competitive algorithms for explainable kmedians and k-means, respectively. Frost, Moshkovitz, and Rashtchian [2020] provided some empirical evidence that bi-criteria algorithms for explainable k-means (that partition the data set into (1 + δ)k clusters) can give a much better competitive ratio than O(k). Then, Makarychev and Shan [2022] gave a Õ( 1 δ log 2 k) competitive bi-criteria algorithm for explainable k-means. Bandyapadhyay, Fomin, Golovach, Lochet, Purohit, and Simonov [2022] provided an algorithm that computes the optimal explainable k-medians and k-means clustering in time n 2d+O(1) and (4nd) k+O(1) , respectively. Laber, Murtinho, and Oliveira [2023] proposed to use shallow decision trees for explainable clustering.', '< Independently and concurrently with our work, Gupta, Pittu, Svensson, and Yuan [2023] proved a O(log k) bound on the price of explainability for k-medians. They showed that the competitive ratio of RANDOMCOORDINATECUT is 1 + H k-1 , where H k is the k-th harmonic number. Their work answers the open question raised by Gamlath, Jia, Polak, and Svensson [2021]. They also proved a hardness of approximation result for explainable k-medians clustering and improved the competitive ratio for explainable k-means from O(k log k) to O(k log log k).', '---', '> The increasing deployment of machine learning in critical domains like healthcare, finance, and public policy necessitates algorithmic transparency and interpretability. As machine learning models influence significant decisions, understanding their underlying logic is paramount. This paper addresses the challenge of explainable clustering, a crucial step towards transparent unsupervised learning. Specifically, we focus on explainable k-medians clustering, aiming to produce data partitions that are easily understood and visualized by humans.', '24a8,21', '> Traditional clustering algorithms, such as k-means, k-medians, and k-medoids, are fundamental centroid-based methods. They partition data into Voronoi cells based on proximity to k centers. However, these Voronoi cells often exhibit complex boundaries (as illustrated in Figure 1), making the resulting clusters difficult for human users to comprehend and interpret.', '> ', '> To address this problem, Dasgupta, Frost, Moshkovitz, and Rashtchian [2020] introduced the concept of explainable k-means and k-medians clustering, proposing threshold decision trees as an intuitive and interpretable means to define clusters. A threshold decision tree is a binary space partitioning tree with k leaves. Each internal node performs a threshold cut (j, θ), splitting data points based on whether x j ≤ θ or x j > θ. This process recursively partitions the d-dimensional space into k rectangular regions, each corresponding to a cluster P i (see Figure 1 for an illustration). The cost of such a threshold decision tree T is measured using standard k-medians objectives: cost(X, T) = k i=1 x∈Pi ∥x -ĉi ∥ 1 , where ĉ1 , . . . , ĉk are the medians of clusters P 1 , . . . , P k . We denote the ℓ 1 -norm by ∥ • ∥ 1 . Unlike unconstrained k-medians, points are not necessarily assigned to their globally closest center, but rather to the center of their assigned rectangular region.', '> ', "> The 'price of explainability' quantifies the trade-off between interpretability and clustering quality, defined as the ratio of the explainable k-medians cost to the optimal unconstrained k-medians cost. Dasgupta et al. [2020] demonstrated that this cost depends only on k, not on the data set size. They introduced a greedy algorithm that achieves an O(k)-competitive ratio for explainable k-medians, given k reference centers. The standard approach involves first obtaining these reference centers via an off-the-shelf approximation algorithm for k-medians, followed by an α-competitive algorithm for explainable k-medians. They also established an Ω(log k) lower bound on the price of explainability for both k-medians and k-means.", '> ', '> The problem of explainable clustering has garnered significant research attention. Several algorithms, including those by Makarychev and Shan [2021] and Esfandiari et al. [2022], have been proposed, achieving near-optimal competitive ratios of Õ(log k) for k-medians and Õ(k) for k-means. A particularly influential and simple algorithm, independently discovered by multiple groups, is the RANDOMCOORDINATECUT algorithm.', '> ', '> The RANDOMCOORDINATECUT algorithm constructs a threshold decision tree from a given set of k reference centers c 1 , . . . , c k . It operates by recursively partitioning the d-dimensional space. Starting with all centers at the root, at each step, a random coordinate j and threshold θ are chosen. This cut is applied to every current cell, splitting the centers within it. If a cut successfully separates centers in a cell, that cell is divided into two sub-regions. This process continues until each leaf node of the tree contains exactly one reference center. The pseudo-code for this algorithm is presented in Figure 2. Prior analyses by Makarychev and Shan [2021] and Esfandiari et al. [2022] established an O(log k log log k) competitive ratio for RANDOMCOORDINATECUT. Despite this, it was conjectured that the algorithm is optimal, achieving an O(log k) competitive ratio, more specifically, H k-1 + 1, where H k is the k-th harmonic number.', '> ', "> Our Results. In this paper, we present a tight analysis of the RANDOMCOORDINATECUT algorithm for explainable k-medians. We prove that its competitive ratio is at most 2 ln k + 2. This bound is optimal, matching the Ω(log k) lower bound established by Dasgupta, Frost, Moshkovitz, and Rashtchian [2020], thereby resolving a long-standing conjecture regarding the algorithm's performance. Our analysis is not only tight but also notably simple, leveraging a novel framework centered around the 'Set Elimination Game,' a concept implicitly present in prior work, which we formalize and analyze to derive our main result.", '> ', '> Concurrent Work. Independently and concurrently with our work, Gupta, Pittu, Svensson, and Yuan [2023] also proved an O(log k) bound on the price of explainability for k-medians. They showed that the competitive ratio of RANDOMCOORDINATECUT is 1 + H k-1 , where H k is the k-th harmonic number. Their work similarly answers the open question raised by Gamlath, Jia, Polak, and Svensson [2021] and further improved competitive ratios for explainable k-means from O(k log k) to O(k log log k).', '> ', '260d256', '< ']
