Abstract: Clustering is a fundamental unsupervised learning problem with lots of applications in data mining, image classification, and other fields. Although clustering algorithms for k-means and k-median are widely used due to their simplicity and effectiveness on point-based data, they often perform poorly on structured data such as lines, graphs, and time series, where point-based representations fail to capture the underlying structure. Moreover, existing algorithms for structured data typically do not account for fairness constraints, which are increasingly important in modern applications involving sensitive attributes such as gender, race, or user groups. In this paper, we formally introduce the group fair k-median of lines problem (Gf-k-Ml), a new variant of k-median that integrates fairness constraints into the clustering of structured data represented by lines. Given a set L of n lines in \(\mathbb {R}^d\), partitioned into t disjoint color groups \(L_1, \ldots , L_t\), the goal of the Gf-k-Ml problem is to partition L into k clusters such that the proportion of lines from each color group in each cluster remains within a specified range, and the sum of distances over each line to its assigned center is minimized. We introduce a group-wise coreset construction algorithm that computes a separate coreset for each group obtained by partitioning the input lines according to sensitive attributes, and prove that these groups satisfy composability under fairness constraints. Our main result is a coreset that satisfies the fairness constraint and has a size of \( O\left( \frac{td^2k\log ^2k\log ^2n}{\varepsilon ^2}\right) \) with an error parameter \(\varepsilon \in (0,1) \), which can be constructed in nearly linear time with respect to n.
Loading