The Underlying Universal Statistical Structure of Natural Datasets

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We analyze natural and synthetic datasets using RMT tools, finding that some universal properties are related to the strength of correlations in the feature distribution and connect these with the number of samples required to reach ergodicity.
Abstract: We study universal properties in real-world complex and synthetically generated datasets. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure. Examining the local and global eigenvalue statistics of feature-feature covariance matrices, we find: (i) bulk eigenvalue power-law scaling vastly differs between uncorrelated Gaussian and real-world data, (ii) this power law behavior is reproducible using Gaussian data with long-range correlations, (iii) all dataset types exhibit chaotic RMT universality, (iv) RMT statistics emerge at smaller dataset sizes than typical training sets, correlating with power-law convergence, (v) Shannon entropy correlates with RMT structure and requires fewer samples in strongly correlated datasets. These results suggest natural image Gram matrices can be approximated by Wishart random matrices with simple covariance structure, enabling rigorous analysis of neural network behavior.
Lay Summary: Modern AI models seem to rely on discovering common patterns to learn from large amounts of data, such as images. We've discovered that despite their complexity, these "natural datasets" share some surprising and universal statistical "fingerprints." By treating data as a physical system, we used tools from physics (particularly Random Matrix Theory) to study the relationships between features in the data (imagine the correlations between pixels in an image). We found that the way these relationships are structured follows predictable mathematical rules, specifically a "power-law" pattern in how the data's core components (eigenvalues) are distributed. This pattern is consistent across different types of datasets, from real-world images to specially created ones. Importantly, these datasets behave like "chaotic" systems in physics, meaning their statistical properties can be described by well-understood universal theories. This discovery suggests we can create simpler, more understandable models of complex data, which could help us better understand how artificial intelligence learns and how to improve it.
Primary Area: Theory->Everything Else
Keywords: Random Matrix Theory, Data Structure, Universality, Gaussian data Empirical Data Estimation, Power Law Scaling
Submission Number: 4106
Loading