Abstract: Matrix completion is a classical problem that has received recurring interest from a wide range of fields. In this paper, we revisit this problem in an ultra-sparse sampling regime, where each entry of an unknown, $n\times d$ matrix $M$ (with $n \ge d$) is observed independently with probability $p = C / d$, for a fixed constant $C \ge 2$. This setting is motivated by applications involving large, sparse panel datasets, where the number of rows (users) far exceeds the number of columns (items). When each row contains only $C$---fewer than the rank of $M$---accurate imputation of $M$ is impossible. Instead, we focus on estimating the \emph{row span} of $M$, or equivalently, the averaged second-moment matrix $T = M^{\top} M / n$.
The empirical second-moment matrix computed from observational data exhibits non-random and sparse missingness. We propose an unbiased estimator that normalizes each nonzero entry of the second moment by its observed frequency, followed by gradient descent to impute the missing entries of $T$. This normalization divides a weighted sum of $n$ binomial random variables by their total number of ones---a nonlinear operation. We show that the estimator is unbiased for any value of $p$ and enjoys low variance. When the row vectors of $M$ are drawn uniformly from a rank-$r$ factor model satisfying an incoherence condition, we prove that if $n \ge O({d r^5 \epsilon^{-2} C^{-2} \log d})$, any local minimum of the gradient-descent objective is approximately global and recovers $T$ with error at most $\epsilon^2$.
Experiments on both synthetic and real-world data validate our approach. On three MovieLens datasets, our algorithm reduces bias by $88\%$ relative to several baseline estimators. We also empirically evaluate the linear sampling complexity of $n$ relative to $d$ using synthetic data. Finally, on the Amazon reviews dataset with sparsity $10^{-7}$, our method reduces the recovery error of $T$ by $59\%$ and $M$ by $38\%$ compared to existing matrix completion methods.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: - Revised the paper to address the comments from the reviewers, including remarks on the rank-$r$ factor models, and potential extensions.
- Improved the exposition in the abstract and introduction, and added several related works to discuss alternative algorithms besides gradient descent for the imputation step.
- Fixed typos and other issues suggested by the reviewers.
Assigned Action Editor: ~Ruoyu_Sun1
Submission Number: 5113
Loading