Abstract: In the paper, we study the sparse $\{0,\pm1\}$-matrix based random projection, which has been widely applied in classification tasks to reduce data dimensionality. For such sparse matrices, it is computationally interesting to explore the minimum number of nonzero entries $\pm1$ required to achieve the best or nearly best classification performance. For this purpose, we analyze the impact of matrix sparsity on the $\ell_1$ distance between projected data points. Our analysis is grounded in the fundamental principle of Principle Component Analysis (PCA), which posits that larger variances between projected data points can better capture the variation inherent in the original data, thereby improving classification performance. Theoretically, the $\ell_1$ distance between projected data points is related not only to the sparsity of sparse matrices, but also to the distribution of original data. Without loss of generality, we consider two typical data distributions: the Gaussian mixture distribution and the two-point distribution, commonly employed in modeling real-world data distributions. Using these two data distributions, we estimate the $\ell_1$ distance between projected data points. It is found that the sparse matrices with only one or a few dozen nonzero entries per row, can provide comparable or even larger $\ell_1$ distances compared to denser matrices, provided the matrix size $m\geq\mathcal{O}(\sqrt{n})$. Accordingly, a similar performance trend should be observed in classification. This is confirmed with classification experiments on real data of different types, including image, text, gene and binary-quantized data.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=Z5j4ydrDRy&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: Dear Dr. Sprechmann,
We would like to express our sincere gratitude to you and the previous AE and reviewers for taking the time to review our manuscript.
**This is a resubmission (encouraged by the AE) of the previously rejected manuscript TMLR 2288. For the convenience of review, if possible, please help to assign the manuscript to the previous anonymous AE and reviewers.**
**The major contribution of the work has been acknowledged by the previous AE and reviewers**, in accordance to the concluding remarks of the AE: "All the reviewers have given yes to both the claims and evidence questions. Similarly, all the reviewers believe that the work itself is interesting. "
**The primary deficiency of the previous manuscript**, as commented by the AE, is that our major claim is derived by the numerical analysis (P3-P6) on the theoretical results of Theorems 1 and 2, rather than by *direct proofs*. So the AE kindly suggested us to "tone down such sentences or have more rigorous arguments to support those claims".
Following the suggestions, in the revised manuscript **we have addressed the problem by providing more rigorous arguments (Eqs. (7) and (13) in Theorems 1 and 2) to support our major claim:** the sparse matrices with a few nonzero entries per row can provide comparable classification performance with the other more dense matrices.
The new additions to the manuscript (highlighted in red) mainly include Eqs. (7) (8) (13) (14) and the discussions on them, which are briefly introduced as follows.
1) In Theorems 1 and 2, we have further analyzed the convergence error of the expected $\ell_1$ distance $\mathbf{E}(|r^\top x|)$ with finite matrix sparsity $k$, as detailed in Eqs. (7) and (13).
2) In the Remarks of Theorems 1 and 2, by Eqs. (7) and (13), we have further derived the lower bound of $k$ that ensures the convergence error ratio upper-bounded by any given positive constant $\eta$, as shown in Eqs.(8) and (14). Within the bound of $k$ derived for small $\eta$, $\mathbf{E}|r^{\top}x|$ should take similar values for different $k$,and accordingly, the different $k$ should also yield similar classification performance. Then we reach *the conclusion* that the sparse matrices with small $k$ (taking the values around its lower bound), will provide comparable classification performance with the other more dense matrices with larger $k$.
3) Considering the smaller the lower bound of $k$ (derived under smaller error ratios), the sparser matrix structure we can obtain. In Fig.2, we have examined the value of the lower bound of $k$ that ensures obtaining small error ratios, by computing the convergence error ratios for different $k$. It is found that small error ratios (close to zero) can be approached with small $k$, such as $k\geq 20$, corresponding to sparse matrix structures.
Assigned Action Editor: ~Bamdev_Mishra1
Submission Number: 3167
Loading