Abstract: We explore the single-spiked covariance model within the context of sparse principal component analysis (PCA), which aims to recover a sparse unit vector from noisy samples. From an information-theoretic perspective, $O(k \log p)$ observations are sufficient to recover a $k$-sparse $p$-dimensional vector $\mathbf{v}$. However, existing polynomial-time methods require at least $O(k^2)$ samples for successful recovery, highlighting a significant gap in sample efficiency. To bridge this gap, we introduce a novel thresholding-based algorithm that requires only $\Omega(k \log p)$ samples, provided the signal strength $\lambda = \Omega(||\mathbf{v}||_\infty^{-1})$. We also propose a two-stage nonconvex algorithm that further enhances estimation performance. This approach integrates our thresholding algorithm with truncated power iteration, achieving the minimax optimal rate of statistical error under the desired sample complexity. Numerical experiments validate the superior performance of our algorithms in terms of estimation accuracy and computational efficiency.
Lay Summary: We explore how many samples are required to recover an unknown sparse signal from noisy data in a fundamental high-dimensional statistics problem called sparse principal component analysis (PCA). The number of samples required by existing polynomial-time algorithms is greater than the theoretical minimum.
We design two algorithms to bridge this gap. The first, a thresholding-based method, achieves the theoretical sample limit when the signal is strong enough relative to its largest component. The second combines the first with iterative refinement, enhancing estimation performance.
Our work partly resolves a basic challenge in sparse PCA. Experiments confirm that our methods outperform existing approaches, achieving both speed and precision.
Primary Area: Probabilistic Methods
Keywords: Sparse PCA, principal component analysis, single-spiked covariance model, sample complexity
Submission Number: 11764
Loading