Abstract: Matrix completion is a classical problem that has received recurring interest from a wide range of areas. In this paper, we revisit this problem in an ultra-sparse sampling regime, where each entry of an unknown $n$ by $d$ matrix $M$ ($n \ge d$) is observed independently with probability $p = \frac C d$, for a fixed constant $C \ge 2$. This setting is motivated by applications dealing with large, sparse panel datasets, where the number of rows (e.g., users) is much larger than the number of columns or items. With only $C$ observed entries that is less than the rank of $M$ in each row, accurately imputing the missing entries of $M$ is not possible. We consider estimating the row span of $M$, or the averaged second-moment matrix $T = \frac 1 n M^{\top} M$.
The empirical second-moment matrix constructed from observed data involves non-random and sparse missingness. We design an unbiased estimator that normalizes each nonzero entry of the second moment with the observed frequencies, followed by gradient-based imputation of the missing entries. The normalization divides a weighted sum of $n$ binomial random variables by the total number of ones in the binomial, which is nonlinear. We show that the estimator is unbiased for any sampling probability $p$ and incurs low variance. Assuming the row vectors of $M$ are drawn uniformly from a rank-$r$ factor model, we prove that provided the number of rows $n$ is greater than $O(\frac{d r^5 \log d}{C^2\epsilon^2})$, our algorithm can recover $T$ with Frobenius norm error less than $\epsilon^2$, under an incoherence condition on the factor model.
Experiments on both synthetic and real-world data are provided to evaluate our algorithms. When tested on three MovieLens datasets, our algorithm reduces bias by $88\%$ relative to its alternative estimators. We also empirically validate the linear sampling complexity of $n$ relative to $d$ on synthetic data. Finally, we apply the recovered row span to impute missing entries of $M$. On an Amazon reviews dataset with sparsity $10^{-7}$, our algorithm reduces the recovery error of $T$ by $59\%$ and $M$ by $38\%$ compared to existing matrix completion methods.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: - Revised the paper to address the comments from the reviewers, including remarks on the rank-$r$ factor models, and potential extensions.
- Improved the exposition in the abstract and introduction, and added several related works to discuss alternative algorithms besides gradient descent for the imputation step.
- Fixed typos and other issues suggested by the reviewers.
Assigned Action Editor: ~Ruoyu_Sun1
Submission Number: 5113
Loading