Keywords: data manifold, manifold learning, generalization bounds controlled datasets deep learning theory
Abstract: A significant gap exists between theory and practice in deep learning. While generalization and approximation error bounds have been proposed, they are often restricted to overly simplified models or result in loose guarantees. Many of these bounds rely on the manifold hypothesis and depend on geometric regularity properties such as intrinsic dimension, curvature, or reach of the data manifold or target functions. However, evaluations of such bounds typically fall into two extremes: either synthetic, analytically defined manifolds where geometric properties are precisely known, or real-world datasets where the bounds are judged solely by downstream performance. Neither approach adequately reveals how data geometry affects the tightness or applicability of the theoretical results.
We propose a general-purpose framework for studying data geometry by creating dense, controllable versions of dSprites and COIL-20 with additional transformation dimensions and fine sampling resolution. This setup enables accurate finite-difference estimates of geometric measures such as curvature, reach, and volume, offering a flexible benchmark for analyzing manifold learning methods. As illustrative applications, we evaluate two established manifold learning bounds by Genovese et al. and Fefferman et al., and we examine how manifold geometry evolves across network layers in $\beta$-VAEs. Our results highlight both the limitations of existing bounds and the potential of such controlled benchmarks to guide future theoretical developments.
Primary Area: learning theory
Submission Number: 14642
Loading