Abstract: n the Subspace Clustering with Missing Data (SCMD) problem, we are given a collection of n
data points, arranged into columns of a matrix X ∈ Rd×n, and each of the data points is observed
only on a subset of its coordinates. The data points are assumed to be concentrated near a union
of low-dimensional subspaces. The goal of SCMD is to cluster the vectors of the data matrix X
as per their subspace membership. State-of-the-art algorithms for SCMD can perform poorly on
instances with a large amount of missing data or if the data matrix X is nearly full-rank. We propose
a novel integer programming-based method for SCMD. The approach is based on dynamically
determining a set of candidate subspaces and optimally assigning points to selected subspaces. The
problem structure is identical to the classical facility-location problem, with subspaces playing
the role of facilities and data points that of customers. We propose a column-generation approach
for identifying candidate subspaces combined with a Benders decomposition approach for solving
the linear programming relaxation of the formulation. An empirical study demonstrates that the
proposed approach can achieve better clustering accuracy than state-of-the-art methods when the
data is high-rank or the percentage of missing data is high
Loading