Abstract: The \emph{$k$-means for lines} is a set of $k$ centers (points) that minimizes the sum of squared distances to a given set of $n$ lines in $\REAL^d$. This is a straightforward generalization of the $k$-means problem where the input is a set of $n$ points. Related problems minimize sum of (non-squared) distances, other norms, $m$-estimators or ignore the $t$ farthest lines (outliers) from the $k$ centers. We suggest the first algorithms that get an error parameter $\varepsilon \in (0, 1)$, and compute a $(1 +\varepsilon)$-approximation to theses problems in time near-linear in $n$ for every constant $k\geq1$, including support for streaming and distributed input. This is by proving that there is a subset, called \emph{core-set}, of $O(d\log^2n/\varepsilon^2)$ weighted lines that approximates the sum of these distances for any given set of $k$ centers, and an efficient construction. Experimental results on Amazon EC2 cloud and open source are also provided.
Code Link: https://github.com/YairMarom/k_lines_means
CMT Num: 6975
0 Replies
Loading