Title: Random Cuts are Optimal for Explainable k-Medians

Abstract: We show that the RANDOMCOORDINATECUT algorithm gives the optimal competitive ratio for explainable k-medians in ℓ 1 . The problem of explainable k-medians was introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian in 2020. Several groups of authors independently proposed a simple polynomial-time randomized algorithm for the problem and showed that this algorithm is O(log k log log k) competitive. We provide a tight analysis of the algorithm and prove that its competitive ratio is upper bounded by 2 ln k + 2. This bound matches the Ω(log k) lower bound by Dasgupta et al (2020). Note that the running time of this algorithm is Õ(kd). Gamlath, Jia, Polak, and Svensson [2021] provided a slightly worse bound of O(log 2 k) on the competitive ratio of this algorithm. They also

Section: Introduction
Machine learning is being increasingly used to make decisions for critical applications, such as healthcare, finance, and public policy. Considering the profound impact of algorithmic decisions on individuals and society, it is essential to understand the underlying logic behind these decisions. In this paper, we explore an explainable k-medians clustering algorithm (called RANDOMCOORDINATE-CUT). The algorithm's aim is to cluster data sets and present results in a manner easily understood and visualized by humans.
Clustering is a fundamental task in unsupervised learning. Among many clustering methods, k-means, k-medians, and k-medoids are particularly popular. These are centroid-based methods that choose k centers and assign each data point to the center nearest to it. As a result, each cluster is a Voronoi cell in the Voronoi partition of the space. Since these cells may have a complicated boundary (see Figure 1 for an example of k-medians), it is not always easy for humans to comprehend and visualize such clustering.
To address this problem, Dasgupta, Frost, Moshkovitz, and Rashtchian [2020] introduced explainable k-means and k-medians clustering. They argued that decision trees are easy to understand and interpret. Therefore, in order to make clustering more explainable, we need to use threshold decision trees to define clusters. A threshold decision tree is a binary space partitioning tree with k leaves. Each internal node of the threshold decision tree splits the data into two groups using a threshold cut (j, θ): on the one side of the cut, we have points x with x j ≤ θ and, on the other side, points x with x j > θ. Thus, every node of the tree corresponds to a rectangular region of the space. A decision tree with k leaves partitions data set X into k clusters, P 1 , . . . , P k . See Figure 1 for an example. Dasgupta et al. [2020] suggested that we use the standard k-medians and k-means objectives to measure the cost of the threshold decision tree. For k-medians, the cost of a threshold decision tree T equals cost(X, T ) = k i=1 x∈Pi ∥x -ĉi ∥ 1 , where P 1 , . . . , P k is the partitioning of X produced by T ; and ĉ1 , . . . , ĉk are the medians of clusters P 1 , . . . , P k . We denote the ℓ 1 -norm by ∥ • ∥ 1 . Note that each P i is a rectangular region of the space. Thus, generally speaking, every x is not assigned to the closest center ĉ1 , . . . , ĉk like in unconstrained k-medians or k-means. y ≤ 8.6 1
x ≤ -1.9
2 3  Dasgupta, Frost, Moshkovitz, and Rashtchian [2020] defined the price of explainability as the ratio of the k-medians cost of explainable clustering to the optimal cost of unconstrained k-medians clustering. They showed that the cost of explainability for k-means and k-medians (somewhat surprisingly) does not depend on the number of points in the data set X and only depends on k. Specifically, they provided a greedy algorithm that given k reference centers c 1 , c 2 , • • • , c k of any unconstrained k-medians as input, outputs a threshold decision tree of cost at most O(k) times the cost of original unconstrained k-medians with centers c 1 , c 2 , • • • , c k . We call such an algorithm O(k) competitive.
To get an explainable k-medians clustering, we first obtain reference centers c 1 , c 2 , • • • , c k using an off-the-shelf approximation algorithm for k-medians and then run an α-competitive algorithm for explainable k-medians with centers c 1 , c 2 , • • • , c k given as input. This algorithm produces the desired threshold decision tree. Dasgupta et al. [2020] also gave an O(k 2 ) competitive algorithm for k-means and showed Ω(log k) lower bounds on the price of explainability for both k-medians and k-means.
The notion of explainable clustering immediately got a lot of attention in the field (Laber and Murtinho [2021], Makarychev and Shan [2021], Gamlath et al. [2021], Charikar and Hu [2022], Esfandiari et al. [2022]). Particularly, Makarychev and Shan [2021], Esfandiari, Mirrokni, and Narayanan [2022] provided almost optimal algorithms for explainable k-medians, and Makarychev and Shan [2021], Esfandiari, Mirrokni, and Narayanan [2022], Gamlath, Jia, Polak, and Svensson [2021] provided almost optimal algorithms for k-means. The competitive ratios of these algorithms are Õ(log k) for k-medians and Õ(k) for k-means.
The algorithms for explainable k-medians by Makarychev and Shan [2021], Esfandiari, Mirrokni, and Narayanan [2022], Gamlath, Jia, Polak, and Svensson [2021] are variants of the same simple algorithm, which we call RANDOMCOORDINATECUT. This algorithm receives a set of k reference centers c 1 , . . . , c k as input and then builds a threshold decision tree with k leaves. It works as follows.
It recursively partitions d-dimensional space until every cell contains exactly one reference center c i . The algorithm starts with a tree consisting of one node, the root. Initially, all k reference centers are assigned to that root. At every step, the algorithm picks a random threshold cut (j, θ) and splits centers in every cell using this cut. If this cut does not separate any centers in a cell u (i.e., all centers in u are located on one side of the cut), then the algorithm does not split u into two regions at this step. Finally, for every leaf u of the constructed tree, the unique center that belongs to the cell corresponding to u is assigned to u. We provide pseudo-code for this algorithm in Figure 2. Makarychev and Shan [2021], Esfandiari et al. [2022] showed that the competitive ratio of RAN-DOMCOORDINATECUT is at most O(log k log log k). That is, for every data set X and set of centers
c 1 , . . . , c k , E[cost(X, T )] ≤ O(log k log log k) • Input: a data set X ⊂ R d and set of centers C = {c 1 , c 2 , . . . , c k } ⊂ R d Output: a threshold tree T Create tree T 0 containing a root node r. Assign C r = {c 1 , c 2 , • • • , c k } to the root. Let t = 0. Let M = max ij |c i j |.
while T n contains a leaf with at least two distinct centers do Pick a coordinate j and threshold θ ∈ (-M, M ) uniformly at random. Let ω n = (j, θ).
For every leaf node u in T n , split the set C u into two sets:
Left = {c ∈ C u : c j ≤ θ} and Right = {c ∈ C u : c j > θ}.
If both sets are not empty, then create two children of u in tree T t . The left child corresponds to the subregion of u with x j ≤ θ, and the right child corresponds to the subregion of u with x j > θ. Assign sets Left and Right to the left and right child, respectively.
Denote the updated tree by T t+1 . Update t = t + 1. end while Figure 2: RANDOMCOORDINATECUT algorithm conjectured that this algorithm is optimal and its competitive ratio is O(log k), more specifically, H k-1 + 1, where H k is the k-th harmonic number. They provided some justification for their conjecture by proving this bound for a very special set of centers and data points (corresponding to the case of completely disjoint sets in our Set Elimination Game).
Our Results. In this work, we show that indeed the competitive ratio of RANDOMCOORDINATECUT is at most 2 ln k + 2, and, therefore, this algorithm has the optimal competitive ratio which matches the lower bound of Dasgupta, Frost, Moshkovitz, and Rashtchian [2020]. Our analysis is not only tight but also fairly simple. To get our result we define a game, the Set Elimination Game, which was also implicitly analyzed in previous works on this topic. We show that the cost of this game is at most 2 ln k + 2. Related Work. The unconstrained k-medians clustering has been extensively studied. Charikar, Guha, Tardos, and Shmoys [1999] gave the first constant factor approximation algorithm for the problem in general metric spaces. Li and Svensson [2013] provided a 1 + √ 3 + ε approximation algorithm. Byrka, Pensyl, Rybicki, Srinivasan, and Trinh [2017] improved the approximation factor to 2.675 + ε. Cohen-Addad, Esfandiari, Mirrokni, and Narayanan [2022] recently improved the approximation factor to 2.406 for Euclidean k-medians. Megiddo and Supowit [1984] showed that the k-medians in ℓ 1 problem is NP-hard. Cohen-Addad and Lee [2022] showed that it is also NP-hard to approximate k-medians in ℓ 1 within a factor of 1.06.
As we discuss above, Gamlath, Jia, Polak, and Svensson [2021], Esfandiari, Mirrokni, and Narayanan [2022], Makarychev and Shan [2021], independently proposed the RANDOMCOORDINATECUT algorithm. They also gave an Õ(k) algorithm for explainable k-means and showed a lower bound of Ω(k) for the problem. Charikar and Hu [2022] provided an O(k 1-2/d • poly(d, log k)) competitive algorithm for explainable k-means, whose competitive ratio depends on the dimension d of the instance. For small d ≪ log k/ log log k, their bound is better than O(k). They showed an almost matching Ω(k 1-2/d /ploy log k) lower bound for explainable k-means. Esfandiari et al. [2022] gave an upper bound of O(d log 2 d) on the competitive ratio of RANDOMCOORDINATECUT for explainable k-medians. This bound is better than O(log k) for small d ≪ log k/ log log k. Laber and Murtinho [2021] gave O(d log k) and O(dk log k) competitive algorithms for explainable kmedians and k-means, respectively. Frost, Moshkovitz, and Rashtchian [2020] provided some empirical evidence that bi-criteria algorithms for explainable k-means (that partition the data set into (1 + δ)k clusters) can give a much better competitive ratio than O(k). Then, Makarychev and Shan [2022] gave a Õ( 1 δ log 2 k) competitive bi-criteria algorithm for explainable k-means. Bandyapadhyay, Fomin, Golovach, Lochet, Purohit, and Simonov [2022] provided an algorithm that computes the optimal explainable k-medians and k-means clustering in time n 2d+O(1) and (4nd) k+O(1) , respectively. Laber, Murtinho, and Oliveira [2023] proposed to use shallow decision trees for explainable clustering.
Independently and concurrently with our work, Gupta, Pittu, Svensson, and Yuan [2023] proved a O(log k) bound on the price of explainability for k-medians. They showed that the competitive ratio of RANDOMCOORDINATECUT is 1 + H k-1 , where H k is the k-th harmonic number. Their work answers the open question raised by Gamlath, Jia, Polak, and Svensson [2021]. They also proved a hardness of approximation result for explainable k-medians clustering and improved the competitive ratio for explainable k-means from O(k log k) to O(k log log k).

Section: Set Elimination Game
In this section, we define the set elimination game. Consider a discrete finite measure space (Ω, µ). In this space, each element ω ∈ Ω has a measure of µ(ω), and the measure of every set S ⊆ Ω equals µ(S) = ω∈S µ(ω). Let S 1 , S 2 , . . . , S k ⊂ Ω be k distinct sets which may overlap with each other. The set elimination game proceeds in a series of rounds. Initially, all sets S 1 , . . . , S k enter the competition. Formally, they belong to the set of remaining sets R 0 = {S 1 , . . . , S k }. At every round n, the host picks a random ω n ∈ Ω with probability Pr(ω n = ω) = µ(ω)/µ(Ω). Then, all sets S i that contain ω n are eliminated from the game unless all remaining sets contain ω n , in which case, no set gets eliminated. That is, for n ≥ 1,
R n = R n-1 \ {S i ∈ R n-1 : ω n ∈ S i }, if for some S i ∈ R n-1 , ω n / ∈ S i ; R n-1 ,
otherwise.
(1)
The last remaining set is declared the winner. We denote that winner by winner. We say that the cost of the game is the measure of the winning set, µ(winner).
We remark that R n cannot get empty (in which case, the winner would not be defined) because of the "otherwise" clause in the definition (1). We shall always assume that all sets S 1 , . . . , S k are not only distinct and non-empty but also (a) for every i, µ(S i ) > 0, and (b) for all i and j, µ(S i △S j ) > 0 (here, S i △S j denotes the symmetric difference of sets S i and S j ). Then, in every game, there is a unique winner with probability 1.
We similarly define the set elimination game for arbitrary finite measure spaces: For an arbitrary finite measure space (Ω, µ), element ω n is chosen with probability function Pr(ω n ∈ S) = µ(S)/µ(Ω).
Our main result is the following theorem, which, as we discuss later in Section 2.1, implies that the competitive ratio of the explainable clustering algorithm is 2 ln k + 2. Theorem 2.1. Consider a set elimination game with the finite measure space (Ω, µ) and k distinct sets S 1 , S 2 , . . . , S k (as above). The expected cost of the game is at most
E µ(winner) ≤ (2 ln k + 2) • min i∈[k]
µ(S i ).
To simplify the exposition, we will prove this theorem for discrete finite measure sets. If Ω is not a discrete measure space, we first replace it with a quotient space: We say that ω ′ ∈ Ω and ω ′′ ∈ Ω are equivalent (ω ′ ∼ ω ′′ ) if they are contained in exactly the same set of sets S 1 , . . . , S k . This equivalence relation partitions Ω into at most 2 k different equivalence classes. We replace Ω with the quotient space Ω /∼ whose elements are equivalence classes. In other words, we merge all equivalent ω's. The measure of a new element ω equals to the measure of the corresponding equivalence class.
Organization. In Section 2.1, we discuss the connection between explainable k-medians and set elimination games. We define a set elimination game in a set system I ⊂ {S 1 , . . . , S k } in Section 2.2. Then, we define the hitting and elimination time in Section 2.3. In Section 3, we first illustrate our proof strategy by showing Theorem 2.1 for the case when the smallest set S 1 does not overlap with S 2 , . . . , S k . An important ingredient of our proof is the notion of surprise sets, which we discuss in Section 3.1. Finally, we complete the proof of Theorem 2.1 in Section 3.2.

Section: Explainable k-Medians via Set Elimination Game
In this section, we show how to use Theorem 2.1 to obtain a bound of 2 ln k + 2 on the competitive ratio of the RANDOMCOORDINATECUT algorithm.
Theorem 2.2. The competitive ratio of the RANDOMCOORDINATECUT algorithm for Explainable k-Medians is at most 2 ln k + 2. That is, for every set of centers C = {c 1 , . . . , c k } and data set X, the algorithm finds a random decision tree T such that
E[cost(X, T )] ≤ (2 ln k + 2) • x∈X min c∈{c 1 ,...,c k } ∥x -c∥ 1 .
The pseudo-code for the RANDOMCOORDINATECUT algorithm is provided in Figure 2.
Theorem 2.2 shows that given any k centers C = {c 1 , . . . , c k }, RANDOMCOORDINATECUT finds a decision tree T with cost at most 2 ln k + 2 times the cost of unconstrained k-medians with centers C = {c 1 , . . . , c k }. By using k centers given by any constant approximation algorithm for k-medians, RANDOMCOORDINATECUT finds a decision tree with cost at most O(log k) times the optimal unconstrained k-medians cost. This implies an O(log k) upper bound on the price of explainability.
Proof of Theorem 2.2. Consider an arbitrary data set X ⊂ R d and set of k centers C ⊂ R d . We assume that all points in X and all centers in C are in the cube [-M, M ] d . The threshold decision tree obtained by the RANDOMCOORDINATECUT algorithm partitions the space into k cells. Each cell contains a single reference center c i . The center c i is not necessarily optimal for cluster P i (cluster P i is the intersection of the data set X and i-th cell). However, we will use it as a proxy for the optimal center. In other words, we will upper bound the cost of the threshold decision tree as follows:
cost(X, T ) ≡ min ĉ1 ,...,ĉ k k i=1 x∈Pi ∥x -ĉi ∥ 1 ≤ k i=1 x∈Pi ∥x -c i ∥ 1 .
Let Ω be the set of all coordinate cuts:
Ω = {(j, θ) : j ∈ [d], θ ∈ [-M, M ]}.
We define a measure µ on Ω as follows. For every subset S ⊂ Ω, we set
µ(S) = d j=1 µ L ({θ : (j, θ) ∈ S}),
where µ L is the Lebesgue measure on R. Thus, we have µ(Ω) = 2dM , which implies (Ω, µ) is a finite measure space.
Consider any data point x ∈ X. Define k sets S 1 , S 2 , . . . , S k for the set elimination game. For every i ∈ {1, . . . , k}, let S i be the set of all threshold cuts that separate x and center c i , i.e.,
S i = {(j, θ) ∈ Ω : sign(x j -θ) ̸ = sign(c i j -θ)}.
Note that the ℓ 1 distance from x to center c i equals the measure of S i : ∥x -c i ∥ 1 = µ(S i ). We now examine the set elimination game with sets S 1 , . . . , S k , measure space (Ω, µ), and random sequence of draws ω 1 , ω 2 , . . . (each ω n ∈ Ω is the threshold cut chosen by the RANDOMCOORDINATECUT algorithm at step n). We claim that S i belongs to R n if and only if center c i lies in the same cell as point x after step n of the algorithm. This is the case for n = 0, since R 0 contains all sets S 1 , . . . , S k and the root of the threshold tree contains all centers c 1 , . . . , c k . Then, whenever we pick cut ω n , all centers separated from x by ω n are removed from the cell of x. The only exception from this rule occurs when all centers in that cell lie on the same side of the cut ω n . That is exactly the same rule as we have for the set elimination game (note that center c i is separated from x by ω n if and only if ω n ∈ S i ). Therefore, the same sets S i remain in the game as center c i in the cell of x (namely, sets S i and centers c i have the same indices).
The RANDOMCOORDINATECUT algorithm stops when all leaves of the decision tree contain exactly one center. At this step, the set elimination game contains one set, S i . This set corresponds to the center c i assigned to point x. The cost of the game µ(S i ) equals the distance from x to c i . By Theorem 2.1, we have
E[cost(x, T )] = E[µ(winner)] ≤ (2 ln k + 2) • min i µ(S i ) = (2 ln k + 2) • min i ∥x -c i ∥ 1 .
We sum this bound over all data points x in X and get the desired result.

Section: Local Competitions
We now revisit the definition of the set elimination game and define competitions in subsets of {S 1 , . . . , S k }. For the rest of the proof, we assume (Ω, µ) is a discrete finite measure space. We remind the reader that every set elimination game is determined by an infinite sequence of i.i.d. random variables ω 1 , ω 2 , . . . . In each round n, we sample an element ω n from Ω with probability Pr(ω n = ω) = µ(ω)/µ(Ω). Definition 2.3. Consider a finite measure space (Ω, µ). Let I be a set of subsets of Ω. We say that I is a valid set system if (a) for every S ∈ I, µ(S) > 0, and (b) for every S ′ , S ′′ ∈ I, µ(S ′ △S ′′ ) > 0.
The reader may assume that µ(ω) > 0 for all ω in Ω. Then, the definition above says that in a valid set system I, all sets are non-empty and distinct. Definition 2.4. Consider a finite measure space (Ω, µ). Let ω 1 , ω 2 , . . . be i.i.d. random variables as described above and I be a valid set system. We define a set elimination game in I. Initially, R 0 (I) = I. Then, for every n ≥ 1,
R n (I) = R n-1 (I) \ {S ∈ R n-1 (I) : ω n ∈ S}, if for some S ′ ∈ R n-1 (I), ω n / ∈ S ′ ; R n-1 (I),
otherwise.
(2)
The winner of the game in I, denoted by winner(I), is the only element remaining, or, formally, the unique element in ∩ n≥0 R n (I). If ∩ n≥0 R n (I) contains more than one element, then the winner is not defined. The cost of the game is the measure of the winner, µ(winner(I)).
We remark that ∩ n≥0 R n (I) contains exactly one element with probability 1. Thus, the winner and cost of the game are defined with probability 1.
Consider sets S 1 , . . . , S k from Theorem 2.1. Denote K = {S 1 , . . . , S k }. The definition of the competition among sets S 1 , . . . , S k (given in the beginning of Section 2) is exactly the same as the definition of competition in K. Our goal is to show that E[µ(winner(K))] ≤ 2(ln k + 1) • min Si∈K µ(S i ). In the proof of Theorem 2.1, we will consider competitions in different set systems I ⊆ K. We show the following key lemma. We defer the proof of Lemma 2.5 to Appendix A. Lemma 2.5. Consider a partitioning of the set system K = {S 1 , . . . , S k } into m sets I 1 , . . . , I m . Then, winner(K) ∈ winner(I 1 ), . . . , winner(I m ) .

Section: Set Elimination with Exponential Clock
Consider a set elimination game on sets S 1 , . . . , S k . It is determined by the sequence of random i.i.d. draws ω 1 , ω 2 , . . . . Random variable ω n is chosen in round n. We assign every round a random time τ n . Let the time between two consecutive rounds be an exponential random variable with parameter µ(Ω). Specifically, let ∆τ 1 , ∆τ 2 , . . . be a sequence of i.i.d. exponential random variables with parameter µ(Ω) and each
τ n = τ n-1 + ∆τ n = ∆τ 1 + • • • + ∆τ n .
Note that all ∆τ n are positive and τ 1 , τ 2 , . . . is an increasing sequence with probability 1. The number of draws that occur by time t (i.e., N t (Ω) = |{n : τ n ≤ t}|) is a Poisson process with parameter µ(Ω). We now can think of the set elimination game as follows: The host of the game observes a Poisson process with parameter µ(Ω). Whenever the process jumps (at time τ n ), the host picks an element ω n in Ω with probability Pr(ω n = ω) = µ(ω)/µ(Ω) and eliminates some sets according to the rules of the game discussed above. Note that by assigning every round some time τ n , we do not change the game, the winner, and the cost of the game (because the sequence of random draws ω 1 , ω 2 , . . . remains the same as before). This interpretation of the game allows us to introduce a hitting time h(S) of every subset S ⊂ Ω with the following properties: (a) each h(S) is an exponential random variable with rate µ(S); (b) hitting times of disjoint sets are mutually independent random variables. Definition 2.6. For every subset X ⊂ Ω, the hitting time h(X) is the time τ n when the first ω n is drawn from X: h(X) = min{τ n : ω n ∈ X}. When the set contains one element ω, we will write h(ω) instead of h({ω}).
We also define the elimination time of each set S i . Definition 2.7. Consider any set elimination game with the measure space (Ω, µ) and k sets S 1 , S 2 , . . . , S k in Ω. The elimination time e(S i ) of set S i is the time when set S i is eliminated from the game, i.e., e(S i ) = min{τ n : S i / ∈ R n (K)}. If S i is the winner, then we let e(S i ) = ∞ (because the winner is never eliminated).
Let us examine bound (3). Let Surprise be the set of all surprise sets. Note that Surprise is a random set. Then,
k i=2 Pr S i = winner(K) µ(S i ) ≤ k i=2 Pr S i = winner(K), S i / ∈ Surprise • µ(S i ) (5) + k i=2 Pr S i ∈ Surprise • µ(S i ).
We show in the next section (Lemma 3.3) that the second sum is upper bounded by µ(S 1 ). We now bound the first sum. For every winner S i which is not a surprise set, we have e(S i ) ≥ h(S 1 ) (because S i is the winner) and h(S 1 ) ≤ L/µ(S i ) (because S i is not a surprise set). We also have S i = winner(I -), thus
Pr S i = winner(K), S i / ∈ Surprise ≤ Pr h(S 1 ) ≤ L/µ(S i ) and S i = winner(I -) .
By Lemma 2.9, all hitting times h(S i ) = min ω∈Si h(ω) for i ≥ 2 are independent from h(S 1 ). Thus, winner(I -) is also independent of h(S 1 ) (winner(I -) depends only on the hitting times for sets S i ∈ I -). Therefore, Si)   ≤Lµ(S1)/µ(Si)
Pr S i = winner(K), S i / ∈ Surprise ≤ Pr h(S 1 ) ≤ L/µ(S i ) • Pr S i = winner(I -) = 1 -e -Lµ(S1)/µ(
• Pr S i = winner(I -)
≤ Pr S i = winner(I -) • L • µ(S 1 )/µ(S i ).
We combine all bounds on terms of ( 5) and get the following bound on the expected cost of the game:
µ(S 1 ) + k i=2 Pr S i = winner(I -) • L • µ(S 1 ) + µ(S 1 ) = (L + 2) • µ(S 1 ) = (ln k + 2) • µ(S 1 ).
This concludes the proof of the theorem for the case when S 1 does not overlap with S 2 , . . . , S k . We now analyze surprise sets.

Section: Surprise Sets
In this section, we prove a bound on the probability that a set S i is a surprise set. We no longer assume that S 1 does not intersect with other sets S i . We first show a lemma about exponential random variables. Lemma 3.2. Let X and Y be two independent exponential random variables with positive parameters λ X and λ Y , respectively. Then, for every T ≥ 0, we have
Pr Y ≥ X ≥ T = λ X λ X + λ Y • e -(λ X +λ Y )T .(6)
Proof. The desired probability can be easily found by computing
∞ T (F X (t) -F X (T ))f Y (t)dt, where F X (t) = 1 -e -λ X t is the cumulative distribution function of X, and f Y (t) = λ Y • e -λ Y
t is the probability density function of Y . Here, we give an alternative proof. Write,
Pr Y ≥ X ≥ T = Pr Y ≥ X & min(X, Y ) ≥ T = Pr X ≤ Y | min(X, Y ) ≥ T ) • Pr min(X, Y ) ≥ T .
We have Pr min(X, Y ) ≥ T = e -(λ X +λ Y )T , because the minimum of two independent exponential random variables with parameters λ X and λ Y is an exponential random variable with parameter λ
X +λ Y . Then, Pr X ≤ Y | min(X, Y ) ≥ T ) = Pr X ≤ Y ) because the exponential distribution is memoryless; and Pr X ≤ Y ) = λ X /(λ X + λ Y ).
Lemma 3.3. For every set S i , we have
Pr(S i is surprise set) ≤ 1 k • µ(S 1 ) µ(S i ) .
Proof. First, we show that min(e(S i ), h(S 1 )) ≤ h(S i \ S 1 ).
Claim 3.4. We always have min(e(S i ), h(S 1 )) ≤ h(S i \ S 1 ).
Proof. Consider an arbitrary realization of the game and the time t = h(S i \ S 1 ) when S i \ S 1 is hit. If by this time, S 1 has already been hit then h(S 1 ) < t. Similarly, if by this time, S i has already been eliminated then e(S i ) < t. Otherwise, both S 1 and S i are still remaining in the game at time t. Therefore, when we pick ω ∈ S i \ S 1 at time t, set S i gets eliminated (since ω ∈ S i ; ω / ∈ S 1 ; both S 1 and S i are remaining in the game). Thus, in this case, e(S i ) = t. This concludes the proof.
If S i is a surprise set, then min(e(S i ), h(S 1 )) = h(S 1 ) ≥ L/µ(S i ). By Claim 3.4, we have
h(S i \ S 1 ) ≥ min e(S i ), h(S 1 ) = h(S 1 ) ≥ L/µ(S i ).
Thus, Pr(S i is surprise set) ≤ Pr i \ S 1 ) ≥ h(S 1 ) ≥ L/µ(S i ) . By Lemma 3.2 applied to the independent exponential random variables h(S 1 ), h(S i \ S 1 ), and time T = L/µ(S i ), we have
Pr(S i is surprise set) ≤ µ(S 1 ) µ(S i \ S 1 ) + µ(S 1 ) • e - L(µ(S i \S 1 )+µ(S 1 )) µ(S i ) ≤ 1 k • µ(S 1 ) µ(S i ) .

Section: General Case
Proof of Theorem 2.1. We upper bound the expected cost of the game for arbitrary sets S 1 , . . . , S k .
As before, we assume that S 1 is the smallest set. We remind the reader that each hitting time h(S i ) is an exponential random variable with parameter µ(S i ). In the proof, we will use the definitions of surprise sets (see Definitions 3.1). We also set L = ln k. We define all sets S i for i ̸ = 1 that are not a surprise set to be non-surprise sets.
We separately upper bound the cost of the winner depending on whether the winner is (a) set S 1 , (b) surprise set, (c) non-surprise set. Write
E µ(winner(K)) = E µ(winner(K)) • 1{winner(K) = S 1 } (a) + E µ(winner(K)) • 1{winner is surprise set} (b) + E µ(winner(K)) • 1{winner is non-surprise set} . (c)
Term (a) is upper bounded by µ(S 1 ). We bound term (b) using Lemma 3.3: The probability that a set is a surprise set is at most 1 /k • µ(S 1 )/µ(S i ). Thus, the expected total measure of all sets (not only the surprise winner) is upper bounded by 1 k k i=2 µ(S1) µ(Si) µ(S i ) < µ(S 1 ). We now bound term (c). Define a new random variable: Let cost(ω) be the cost of the winner (i.e., µ(S i ), where S i is the winner) if (1) the winner is a non-surprise set, and (2) ω is the first element that was chosen in S 1 . We let cost(ω) = 0, otherwise. If ω is the first element that was chosen in S 1 , then h(S 1 ) = h(ω). So, the definition of cost(ω) can be written as follows:
cost(ω) = µ(winner(K)) • 1{h(S 1 ) = h(ω)} • 1{winner(K) ̸ ∈ Surprise}.
Since the hitting time h(S 1 ) is finite with probability 1, the term (c) equals  If S i is a non-surprise set, then h(S 1 ) < L/µ(S i ) or e(S i ) < h(S 1 ). If S i is the winner, then e(S i ) ≥ h(S 1 ). Thus, if S i is a non-surprise winner, then h(S 1 ) < L/µ(S i ). This observations gives us the following upper bound on ( 7):
E cost(ω) ≤ k i=2
µ(S i ) • Pr S i = winner(K) and h(ω) = h(S 1 ) < L/µ(S i ) .
Define two set systems I - ω and I + ω of sets S i containing and not containing ω: I - ω = {S i : ω / ∈ S i and i ≥ 2};
I + ω = {S i : ω ∈ S i and i ≥ 2}. Note that K ≡ {S 1 , . . . , S k } = {S 1 } ∪ I - ω ∪ I + ω . By Lemma 2.5, winner(K) ∈ S 1 , winner(I - ω ), winner(I + ω ) . Observe that if S i with i ≥ 2 is the winner, then S i = winner(I - ω ) or S i = winner(I + ω ). We replace the condition S i = winner(K) with S i ∈ {winner(I - ω ), winner(I + ω )} in ( 8) and get bound:
E cost(ω) ≤ k i=2
µ(S i ) • Pr S i ∈ {winner(I - ω ), winner(I + ω )} and h(ω) < L µ(S i ) .
The key observation now is that sets winner(I - ω ) and winner(I + ω ) are independent of h(ω). This is the case, because sets remaining in the competitions R n (I - ω ) and R n (I + ω ) do not change when we select ω. .
Using that h(ω) is an exponential random variable with parameter µ(ω), we get (for every i)
µ(S i ) • Pr h(ω) ≤ L µ(S i ) = µ(S i ) • 1 -e -L µ(ω) µ(S i ) ≤ µ(S i ) • L µ(ω) µ(S i ) = µ(ω)L.
Hence,
E cost(ω) ≤ µ(ω)L • k i=2
Pr S i ∈ {winner(I - ω ), winner(I + ω )} .
The sum on the right hand side is at most 2. Thus, E[cost(ω)] ≤ 2Lµ(ω).

Section: Acknowledgments and Disclosure of Funding
The authors are supported by NSF Awards CCF-1955351, CCF-1934931, EECS-29 2216970.   

Section: 
Note that e(S i ) ≥ h(S i ). Sometimes, e(S i ) may be equal to h(S i ), but e(S i ) and h(S i ) are not always the same. We now prove that hitting times for disjoint sets are independent. To this end, we split the Poisson process N t (Ω) = |{n : τ n ≤ t}|. Let N t (ω) = |{n : τ n ≤ t and ω n = ω}|. It is easy to see that N t (Ω) = ω∈Ω N t (ω) for every t. It is also true that each N t (ω) is a Poisson process with parameter µ(ω) and all N t (ω) (for ω ∈ Ω) are mutually independent. This fact follows from the Coloring Theorem (see e.g., Kingman [1992], Coloring Theorem, page 53). Theorem 2.8 (Coloring Theorem). Let Π t be a Poisson process on the real line with rate λ. We color each event of the Poisson process randomly with one of M colors: The probability that a point receives the i-th color is p i . The colors of different points are independent. Let Π t (i) be the number of events of color i in the interval (0, t]. Then, Π t (1), . . . , Π t (M ) are independent Poisson processes. The rate of process Π t (i) is λp i .
Lemma 2.9. For every ω ∈ Ω, h(ω) is an exponential random variable with parameter µ(ω), and all random variables h(ω) (for ω ∈ Ω) are mutually independent.
Proof. Observe that h(ω) = min{t : N t (ω) ≥ 1}. Thus, h(ω) is an exponential random variable (the time of the first jump of a Poisson process) with rate µ(ω). Also, since all N t (ω) (for ω ∈ Ω) are mutually independent, all h(ω) are also mutually independent.
Note that the set elimination game depends only on the hitting times for elements ω in Ω. This is the case because it matters only when every ω is drawn the first time. At that time -the hitting time of ω -all sets that contain ω are eliminated unless all remaining sets contain this ω. When the same ω is drawn again, it does not eliminate any new sets. Also, note that for any set S ⊂ Ω, the hitting time h(S) = min ω∈S h(ω). Thus, h(S) is an exponential random variable with parameter µ(S) = ω∈S µ(ω).

Section: Proof of Main Result
We now present the proof of our main result, Theorem 2.1. We assume without loss of generality that S 1 is the smallest set i.e., µ(S 1 ) ≤ µ(S i ) for all i. Then, the expected cost of the game is at most:
Pr S i = winner(K) µ(S i ).
(3)
We first provide some intuition for the proof by considering the case when S 1 does not intersect with sets S 2 , . . . , S k , i.e. sets S 1 and S i are disjoint for all i = 2, 3, . . . , k. We split all sets into two groups S 1 and the rest of the sets S 2 , . . . , S k . We know from Lemma 2.5 that the winner among all sets S 1 , . . . , S k is either S 1 or winner {S 2 , . . . , S k } . Denote I -= {S 2 , . . . , S k }. Each set S i is eliminated at time e(S i ). The set S 1 is eliminated at its hitting time h(S 1 ) unless it is the only remaining set at time h(S 1 ) (because we are considering the case when S 1 does not overlap with other sets). Thus,
) > e(winner(I -)); winner(I -), if e(winner(I -)) > h(S 1 ).
(4)
When the winner among S 1 , . . . , S k is not S 1 , we consider two cases of the winner S i : (1) S i is a surprise set; (2) S i is a non-surprise set. Definition 3.1. We say that S i is a surprise set if e(S i ) ≥ h(S 1 ) ≥ L/µ(S i ), where L = ln k.
We call S i a surprise set because the probability of the event e(S i ) ≥ h(S 1 ) ≥ L/µ(S i ) is small. We give a bound on the probability of e(S i ) ≥ h(S 1 ) ≥ L/µ(S i ) in Lemma 3.3. Here, we provide some intuition. By Lemma 2.9, the hitting time h(S i ) is an exponential random variable with parameter µ(S i ). Thus, the expected hitting time for S i is 1/µ(S i ). Consider a set S i with a small measure (µ(S i ) is close to µ(S 1 )). If the hitting time h(S 1 ) ≥ L/µ(S i ), then h(S 1 ) is much larger than its expected value 1/µ(S 1 ), which happens with a small probability. Consider a set S i with a large measure µ(S i ) ≫ µ(S 1 ). Then, the expected hitting time for S i is 1/µ(S i ), which is much smaller than the expected hitting time of S 1 . Thus, the event e(S i ) ≥ h(S 1 ) occurs with a small probability.
A Proof of Lemma 2.5
Lemma 2.5. Consider a partitioning of the set system K = {S 1 , . . . , S k } into m sets I 1 , . . . , I m . Then, winner(K) ∈ winner(I 1 ), . . . , winner(I m ) .
The proof of Lemma 2.5 relies on the following observarion. Lemma A.1. Let X and Y be two subsets of K. If X ⊂ Y , then for every n, we always have
Proof. We prove that (9) holds by induction on n. Initially, when n = 0, we have R 0
. Suppose (9) holds for n, we prove that (9) also holds for
remains empty for all n ′ ≥ n. Therefore, (9) holds for n + 1. So, let us assume that R n (Y ) ∩ X = R n (X). Consider three cases:
• If ω n+1 belongs to all sets in R n (Y ), then it also belongs to all sets in R n (X) = R n (Y )∩X. Thus, in this case, no set is eliminated in
• If ω n+1 belongs to all sets in R n (X), but not all sets in R n (Y ), then, at step n + 1, we remove all sets that contain ω n+1 and, particularly, all sets in R n (X), from R n (Y ).
Consequently, R n+1 (Y ) ∩ X = ∅ .
• If not all sets in R n (X) and not all sets in R n (Y ) contain ω n+1 , then we remove exactly the same sets from both R n (X) and R n (Y ) ∩ X. Namely, we remove sets
We conclude that (9) holds for n ′ = n + 1.
Proof of Lemma 2.5. Consider an arbitrary realization of the game ω 1 , ω 2 , . . . . Let n be the round when all sets but the winner are eliminated from the competition i.e., R n contains only one set, the winner. Since K is the union of I 1 , . . . , I k , the winner must belong to some I j . Now, by Lemma A.1 for X = I j and Y = K, we have R n (K) ∩ I j = R n (I j ) or R n (K) ∩ I j = ∅. We know that R n (K) = {winner(K)} and winner(K) ∈ I j . Thus, R n (K) ∩ I j = {winner(K)} ̸ = ∅, and R n (I j ) = R n (K) ∩ I j = {winner(K)}.
We conclude that at round n, R n (I j ) contains only one set -the winner in K. Consequently, it is also the winner in I j i.e., winner(I j ) = winner(K). This finishes the proof.


References:
[b0] Sayan Bandyapadhyay; Fedor Fomin; Petr A Golovach; William Lochet; Nidhi Purohit; Kirill Simonov (2022). How to find a good explanation for clustering. 
[b1] Jarosław Byrka; Thomas Pensyl; Bartosz Rybicki; Aravind Srinivasan; Khoa Trinh (2017). An improved approximation for k-median and positive correlation in budgeted optimization. ACM Transactions on Algorithms (TALG)
[b2] Moses Charikar; Lunjia Hu (2022). Near-optimal explainable k-means for all dimensions. SIAM
[b3] Moses Charikar; Sudipto Guha; Éva Tardos; David B Shmoys (1999). A constant-factor approximation algorithm for the k-median problem. 
[b4] Vincent Cohen; -Addad ; Euiwoong Lee (2022). Johnson coverage hypothesis: Inapproximability of k-means and k-median in lp-metrics. SIAM
[b5] Vincent Cohen-Addad; Hossein Esfandiari; Vahab Mirrokni; Shyam Narayanan (2022). Improved approximations for euclidean k-means and k-median, via nested quasi-independent sets. 
[b6] Sanjoy Dasgupta; Nave Frost; Michal Moshkovitz; Cyrus Rashtchian (2020). Explainable k-means and k-medians clustering. 
[b7] Hossein Esfandiari; Vahab Mirrokni; Shyam Narayanan (2022). Almost tight approximation algorithms for explainable clustering. SIAM
[b8] Nave Frost; Michal Moshkovitz; Cyrus Rashtchian (2020). Exkmc: Expanding explainable k-means clustering. 
[b9] Buddhima Gamlath; Xinrui Jia; Adam Polak; Ola Svensson (2021). Nearly-tight and oblivious algorithms for explainable clustering. Advances in Neural Information Processing Systems
[b10] Anupam Gupta; Madhusudhan Reddy Pittu; Ola Svensson; Rachel Yuan (2023). The price of explainability for clustering. 
[b11] John Frank; Charles Kingman (1992). Poisson processes. Clarendon Press
[b12] Eduardo Laber; Lucas Murtinho; Felipe Oliveira (2023). Shallow decision trees for explainable k-means clustering. Pattern Recognition
[b13] S Eduardo; Lucas Laber;  Murtinho (2021). On the price of explainability for some clustering problems. PMLR
[b14] Shi Li; Ola Svensson (2013). Approximating k-median via pseudo-approximation. 
[b15] Konstantin Makarychev; Liren Shan (2021). Near-optimal algorithms for explainable k-medians and k-means. PMLR
[b16] Konstantin Makarychev; Liren Shan (2022). Explainable k-means: don't be greedy, plant bigger trees. 
[b17] Nimrod Megiddo; Kenneth J Supowit (1984). On the complexity of some common geometric location problems. SIAM journal on computing

Figures:
Figure fig_0: 1
Type: figure
Caption: Figure 1 :1Figure 1: The unconstrained k-medians clustering and explainable k-medians clustering. The left diagram shows the Voronoi partition of the plane w.r.t. three centers in ℓ 1 distance. The Voronoi cell for each center consists of all points that are closer (in ℓ 1 distance) to this center than to any other center (the boundaries between cells are not straight lines because we use the ℓ 1 distance). The middle diagram shows an explainable partition. The right diagram shows the corresponding decision tree for explainable clustering.
Data: 

Figure fig_1: 
Type: figure
Caption: which we prove below, gives a bound of 2Lµ(S 1 ) on the expression above. Combining upper bounds on terms (a), (b), and (c), we get E µ(winner(K)) ≤ (1 + 2L + 1)µ(S 1 ) = (2 ln k + 2) • µ(S 1 ). Lemma 3.5. For every ω ∈ S 1 , we have E[cost(ω)] ≤ 2Lµ(ω).
Data: 

Figure fig_2: 
Type: figure
Caption: Proof.We haveE[cost(ω)] = E µ(winner(K)) • 1{h(S 1 ) = h(ω)} • 1{winner(K) ̸ ∈ Surprise} .(7)
Data: 

Figure fig_3: 
Type: figure
Caption: The set R n (I - ω ) does not change in the round n when ω is chosen because all sets S i in R n (I - ω ) ⊂ I - ω do not contain ω. The set R n (I + ω ) does not change in this round because all sets S i in R n (I + ω ) ⊂ I + ω contain ω and consequently when ω is chosen, none of these sets is removed from R n (I + ω ) (otherwise, R n (I + ω ) would become empty). Thus,E cost(ω) ≤ k i=2 µ(S i ) • Pr S i ∈ {winner(I - ω ), winner(I + ω )} • Pr h(ω) < L µ(S i )
Data: 


Formulas:
Formula formula_0: c 1 , . . . , c k , E[cost(X, T )] ≤ O(log k log log k) • Input: a data set X ⊂ R d and set of centers C = {c 1 , c 2 , . . . , c k } ⊂ R d Output: a threshold tree T Create tree T 0 containing a root node r. Assign C r = {c 1 , c 2 , • • • , c k } to the root. Let t = 0. Let M = max ij |c i j |.

Formula formula_1: Left = {c ∈ C u : c j ≤ θ} and Right = {c ∈ C u : c j > θ}.

Formula formula_2: R n = R n-1 \ {S i ∈ R n-1 : ω n ∈ S i }, if for some S i ∈ R n-1 , ω n / ∈ S i ; R n-1 ,

Formula formula_3: E µ(winner) ≤ (2 ln k + 2) • min i∈[k]

Formula formula_4: E[cost(X, T )] ≤ (2 ln k + 2) • x∈X min c∈{c 1 ,...,c k } ∥x -c∥ 1 .

Formula formula_5: cost(X, T ) ≡ min ĉ1 ,...,ĉ k k i=1 x∈Pi ∥x -ĉi ∥ 1 ≤ k i=1 x∈Pi ∥x -c i ∥ 1 .

Formula formula_6: Ω = {(j, θ) : j ∈ [d], θ ∈ [-M, M ]}.

Formula formula_7: µ(S) = d j=1 µ L ({θ : (j, θ) ∈ S}),

Formula formula_8: S i = {(j, θ) ∈ Ω : sign(x j -θ) ̸ = sign(c i j -θ)}.

Formula formula_9: E[cost(x, T )] = E[µ(winner)] ≤ (2 ln k + 2) • min i µ(S i ) = (2 ln k + 2) • min i ∥x -c i ∥ 1 .

Formula formula_10: R n (I) = R n-1 (I) \ {S ∈ R n-1 (I) : ω n ∈ S}, if for some S ′ ∈ R n-1 (I), ω n / ∈ S ′ ; R n-1 (I),

Formula formula_11: τ n = τ n-1 + ∆τ n = ∆τ 1 + • • • + ∆τ n .

Formula formula_12: k i=2 Pr S i = winner(K) µ(S i ) ≤ k i=2 Pr S i = winner(K), S i / ∈ Surprise • µ(S i ) (5) + k i=2 Pr S i ∈ Surprise • µ(S i ).

Formula formula_13: Pr S i = winner(K), S i / ∈ Surprise ≤ Pr h(S 1 ) ≤ L/µ(S i ) • Pr S i = winner(I -) = 1 -e -Lµ(S1)/µ(

Formula formula_14: ≤ Pr S i = winner(I -) • L • µ(S 1 )/µ(S i ).

Formula formula_15: µ(S 1 ) + k i=2 Pr S i = winner(I -) • L • µ(S 1 ) + µ(S 1 ) = (L + 2) • µ(S 1 ) = (ln k + 2) • µ(S 1 ).

Formula formula_16: Pr Y ≥ X ≥ T = λ X λ X + λ Y • e -(λ X +λ Y )T .(6)

Formula formula_17: ∞ T (F X (t) -F X (T ))f Y (t)dt, where F X (t) = 1 -e -λ X t is the cumulative distribution function of X, and f Y (t) = λ Y • e -λ Y

Formula formula_18: Pr Y ≥ X ≥ T = Pr Y ≥ X & min(X, Y ) ≥ T = Pr X ≤ Y | min(X, Y ) ≥ T ) • Pr min(X, Y ) ≥ T .

Formula formula_19: X +λ Y . Then, Pr X ≤ Y | min(X, Y ) ≥ T ) = Pr X ≤ Y ) because the exponential distribution is memoryless; and Pr X ≤ Y ) = λ X /(λ X + λ Y ).

Formula formula_20: Pr(S i is surprise set) ≤ 1 k • µ(S 1 ) µ(S i ) .

Formula formula_21: h(S i \ S 1 ) ≥ min e(S i ), h(S 1 ) = h(S 1 ) ≥ L/µ(S i ).

Formula formula_22: Pr(S i is surprise set) ≤ µ(S 1 ) µ(S i \ S 1 ) + µ(S 1 ) • e - L(µ(S i \S 1 )+µ(S 1 )) µ(S i ) ≤ 1 k • µ(S 1 ) µ(S i ) .

Formula formula_23: E µ(winner(K)) = E µ(winner(K)) • 1{winner(K) = S 1 } (a) + E µ(winner(K)) • 1{winner is surprise set} (b) + E µ(winner(K)) • 1{winner is non-surprise set} . (c)

Formula formula_24: cost(ω) = µ(winner(K)) • 1{h(S 1 ) = h(ω)} • 1{winner(K) ̸ ∈ Surprise}.

Formula formula_25: E cost(ω) ≤ k i=2

Formula formula_27: E cost(ω) ≤ k i=2

Formula formula_28: µ(S i ) • Pr h(ω) ≤ L µ(S i ) = µ(S i ) • 1 -e -L µ(ω) µ(S i ) ≤ µ(S i ) • L µ(ω) µ(S i ) = µ(ω)L.

Formula formula_29: E cost(ω) ≤ µ(ω)L • k i=2

