Principal component analysis for very heavy-tailed data

Principal component analysis for very heavy-tailed data

ICLR 2026 Conference Submission22521 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Robust PCA; Heavy-tailed data; Robust statistics; Random matrix theory; Dimensionality reduction; Scalable algorithms; Unsupervised learning

TL;DR: We introduce a simple, scalable PCA algorithm tailored for extremely heavy-tailed data (including infinite-variance cases), which outperforms existing robust PCA methods on synthetic, transcriptomic, and connectomic benchmarks.

Abstract: Principal component analysis (PCA) is a ubiquitous tool for dimensionality reduction and exploratory data analysis. However, most theoretical and empirical studies implicitly assume that noise is light-tailed. When data are corrupted by heavy-tailed noise, as is increasingly common (e.g. in omics or brain connectivity data), standard PCA techniques can fail dramatically. While recent work in robust statistics has addressed this problem in certain contexts, many existing methods remain sensitive to extreme outliers, performing poorly under truly heavy-tailed distributions. Furthermore, many of the methods which have been designed for heavy-tailed distributions do not scale well to large data sizes. In this work, we propose a novel algorithm for PCA that is designed for extremely heavy-tailed noise and which is computable for even very large data matrices. Our approach is designed to reduce sensitivity to such deviations while recovering informative low-rank structure. In the case of very heavy-tailed data with a large number of observations, we demonstrate significant improvements over classical PCA and existing robust PCA variants.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 22521

Loading