TL;DR: We analyze the impact of limited data in the training of energy-based models, focusing on eigenmodes' dynamics, giving a theoretical perspective of early-stopping and data-correction protocols to improve the quality of the inferred model
Abstract: We investigate the impact of limited data on training pairwise energy-based models for inverse problems aimed at identifying interaction networks. Utilizing the Gaussian model as testbed, we dissect training trajectories across the eigenbasis of the coupling matrix, exploiting the independent evolution of eigenmodes and revealing that the learning timescales are tied to the spectral decomposition of the empirical covariance matrix. We see that optimal points for early stopping arise from the interplay between these timescales and the initial conditions of training. Moreover, we show that finite data corrections can be accurately modeled through asymptotic random matrix theory calculations and provide the counterpart of generalized cross-validation in the energy based model context. Our analytical framework extends to binary-variable maximum-entropy pairwise models with minimal variations.
These findings offer strategies to control overfitting in discrete-variable models through empirical shrinkage corrections, improving the management of overfitting in energy-based generative models.
Finally, we propose a generalization to arbitrary energy-based models by deriving the neural tangent kernel dynamics of the score function under the score-matching algorithm.
Lay Summary: This study investigates how machine learning models can be trained effectively when the amount of data is limited, focusing on energy-based generative models. These models are often used to uncover hidden structures in complex datasets, such as gene interactions or brain connectivity. However, with limited data, they tend to overfit — capturing noise rather than meaningful patterns.
To understand this, the authors analyze a simplified model that allows for an exact mathematical treatment. They show that different features in the data are learned at different rates during training and that less relevant features —which are often dominated by noise — take longer to learn. This mismatch of time scales leads to overfitting and degradation of the model if training takes too long.
Based on this insight, the study identifies optimal stopping points to prevent overfitting and introduces mathematical corrections that improve the reliability of the model without the need for additional data. The framework can also be generalized to more complex, non-solvable models.
Overall, this work contributes to the theoretical foundations of machine learning by explaining how data scarcity affects training dynamics and overfitting, with the goal of making models more robust in real-world scenarios where data is often limited.
Primary Area: General Machine Learning->Unsupervised and Semi-supervised Learning
Keywords: energy-based models, overfitting, random-matrix theory, inverse problems, early-stopping, training dynamics, Bolzmann Machine
Submission Number: 13303
Loading