One-Sided Matrix Completion from Ultra-Sparse Samples

TMLR Paper5113 Authors

15 Jun 2025 (modified: 21 Aug 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Matrix completion is a classical problem that has received recurring interest from a wide range of fields. In this paper, we revisit matrix completion in an ultra-sparse sampling setting, where each entry of an unknown, $n$ by $d$ matrix $M$, is observed with probability $p = \frac C d$, for any constant $C \ge 2$ (assuming $n \ge d$). This setting is motivated by large-scale panel datasets with high sparsity in practice. While the total number of observed samples, or roughly $C n$, is insufficient to recover $M$, we show that it is possible to recover one side of $M$, i.e., the second-moment of the row vectors, given by $T = \frac 1 n M^{\top} M$. The empirical second moment computed from observational data involves non-random missingness and high sparsity. We design an algorithm that estimates $T$ by normalizing every nonzero entry of the empirical second moment with its observed frequency, followed by gradient descent to impute the missing entries. The normalized entry divides a weighted sum of $n$ binomial random variables by the total number of ones, which is challenging to analyze due to nonlinearity and sparsity. We provide estimation and recovery guarantees for this estimator in the ultra-sparse regime, showing that it is unbiased for any $p$, and incurs low variance. Assuming the row vectors of $M$ are sampled from a rank-$r$ factor model, we prove that when $n \ge O(\frac{d r^5 \log d}{C^2\epsilon^2})$, our algorithm can recover $T$ with Frobenius norm error less than $\epsilon^2$, assuming the rank-$r$ factor model satisfies a standard incoherence condition. We also extend the use of one-sided matrix completion as a sub-procedure towards imputing the missing entries of $M$. Experiments on both synthetic and real-world data are provided to evaluate this approach. When tested on three MovieLens datasets, our approach reduces bias by $88\%$ relative to its alternatives. We also validate the linear sampling complexity of $n$ relative to $d$ on synthetic data. On an Amazon reviews dataset with sparsity $10^{-7}$, our approach reduces the recovery error of $T$ by $59\%$ and $M$ by $38\%$ compared to existing matrix completion methods.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Ruoyu_Sun1
Submission Number: 5113
Loading