## A geometrical connection between sparse and low-rank matrices and its application to manifold learning

**Abstract:**We consider when a sparse nonnegative matrix $\mathbf{S}$ can be recovered, via an elementwise nonlinearity, from a real-valued matrix~$\mathbf{L}$ of significantly lower rank. Of particular interest is the setting where the positive elements of $\mathbf{S}$ encode the similarities of nearby points on a low dimensional manifold. The recovery can then be posed as a problem in manifold learning---in this case, how to learn a norm-preserving and neighborhood-preserving mapping of high dimensional inputs into a lower dimensional space. We describe an algorithm for this problem based on a generalized low-rank decomposition of sparse matrices. This decomposition has the interesting property that it can be encoded by a neural network with one layer of rectified linear units; since the algorithm discovers this encoding, it can also be viewed as a layerwise primitive for deep learning. The algorithm regards the inputs $\mathbf{x}_i$ and $\mathbf{x}_j$ as similar whenever the cosine of the angle between them exceeds some threshold $\tau\in(0,1)$. Given this threshold, the algorithm attempts to discover a mapping $\mathbf{x}_i\mapsto\mathbf{y}_i$ by matching the elements of two sparse matrices; in particular, it seeks a mapping for which $\mathbf{S}=\max(0,\mathbf{L})$, where $S_{ij} = \max(0,\mathbf{x}_i\!\cdot\!\mathbf{x}_j\! -\! \tau\|\mathbf{x}_i\|\|\mathbf{x}_j\|)$ and $L_{ij} = \mathbf{y}_i\!\cdot\!\mathbf{y}_j\! -\! \tau\|\mathbf{y}_i\|\|\mathbf{y}_j\|$. We apply the algorithm to data sets where vector magnitudes and small cosine distances have interpretable meanings (e.g., the brightness of an image, the similarity to other words). On these data sets, the algorithm is able to discover much lower dimensional representations that preserve these meanings.

**License:**Creative Commons Attribution 4.0 International (CC BY 4.0)

**Submission Length:**Regular submission (no more than 12 pages of main content)

**Changes Since Last Submission:**Many changes were made to incorporate suggestions from the anonymous reviewers. In the main manuscript, there is more discussion of non-convexity and convergence, and in a supplementary appendix, there are new results from a nonlinear (versus linear) baseline for dimensionality reduction. A number of other minor changes were also made throughout the paper. The following is a list of specific changes; page numbers refer to the current manuscript. page 1: The abstract indicates that the sparse matrix is recovered specifically via an elementwise nonlinearity; it also states earlier that the embedding is norm-preserving. The first sentence of the introduction has been reworded. The word "geometrical" has been added to the title. page 2: The second paragraph of the introduction gives another example of rank reduction (when $S$ is the identity matrix). page 4: Further explanation of the model has been added to the paragraph after equation 2. page 5: The last paragraph of section 2.1 clarifies that many inputs are regarded as similar even when they are not one-nearest neighbors. The caption to Figure 2 clarifies that the blue/orange histograms are separately normalized and have small but nonzero overlap. A footnote has been added on the handling of outliers in graph-based methods. page 6: The top paragraph emphasizes that the optimization is non-convex and discusses how one tests empirically for convergence. page 8: The typo after eq. (10) has been fixed (describing the least-squares sense in which linear projections from SVD are optimal). The conflicting notation between eqs. (11-12) has been corrected. A footnote directs the reader to Appendix B, which contains results from a modified implementation of the Isomap algorithm. Another footnote mentions the need to safeguard against stereotypes and implicit biases when word-vectors are used for real-world applications. page 12: An acknowledgements section has been added. page 17: Appendix A.1 contains more discussion on initializing non-convex optimizations for nonlinear dimensionality reduction. page 18: Figure 8 (new) plots the convergence of the objective function for the alternating minimization. pages 18-21: A new supplementary section (appendix B) presents results from a modified implementation of Isomap based on cosine distances. This section also includes three new figures to present and interpret these results.

**Assigned Action Editor:**~Zhihui_Zhu1

**Submission Number:**418

Loading