Keywords: local search,line clustering,k-means of lines
Abstract: Clustering is a fundamental problem that has been extensively studied over past few decades, with most research focusing on point-based clustering such as $k$-means, $k$-median, and $k$-center.
However, numerous real-world applications, such as motion analysis, computer vision, and missing data analysis, require clustering over structured data, including lines, time series and affine subspaces (flats), where traditional point-based clustering algorithms often fall short.
In this paper, we study the $k$-means of lines problem, where the input is a set $L$ of lines in $\mathbb{R}^d$, and the goal is to find $k$ centers $C$ in $\mathbb{R}^d$ such that the sum of squared distances from each line in $L$ to its nearest center in $C$ is minimized. The local search algorithm is a well-established strategy for point-based $k$-means clustering, known for its efficiency and provable approximation guarantees. However, extending local search algorithm  to the $k$-means of lines problem is nontrivial, as the capture relation used in point-based clustering does not generalize to the line setting. This is because that the point-to-line distance function lack the triangle inequality property that supports geometric analysis in point-based clustering. Moreover, since lines extend infinitely in space, it is  difficult to identify effective swap points that can significantly reduce the clustering cost. To overcome above obstacles, we introduce a *proportional capture relation* that links optimal and current centers based the assignment proportions of lines, enabling a refined analysis that bypasses the triangle inequality barrier. We also introduce a *CrossLine* structure, which provides a principled discretization of the geometric space around line pairs, and ensures coverage of high-quality swap points essential for local search, thereby enabling effective execution of the local search process. Consequently, based on the proposed components, we develop the first single-swap local search algorithm for the $k$-means of lines problem, achieving a $(500+\varepsilon)$-approximation in polynomial time for low-dimensional Euclidean space.
Primary Area: General machine learning (supervised, unsupervised, online, active, etc.)
Submission Number: 25128
Loading