Score-based Pullback Riemannian Geometry: Extracting the Data Manifold Geometry using Anisotropic Flows
TL;DR: A scalable framework for extracting the geometry of data manifolds
Abstract: Data-driven Riemannian geometry has emerged as a powerful tool for interpretable representation learning, offering improved efficiency in downstream tasks. Moving forward, it is crucial to balance cheap manifold mappings with efficient training algorithms. In this work, we integrate concepts from pullback Riemannian geometry and generative models to propose a framework for data-driven Riemannian geometry that is scalable in both geometry and learning: score-based pullback Riemannian geometry. Focusing on unimodal distributions as a first step, we propose a score-based Riemannian structure with closed-form geodesics that pass through the data probability density. With this structure, we construct a Riemannian autoencoder (RAE) with error bounds for discovering the correct data manifold dimension. This framework can naturally be used with anisotropic normalizing flows by adopting isometry regularization during training. Through numerical experiments on diverse datasets, including image data, we demonstrate that the proposed framework produces high-quality geodesics passing through the data support, reliably estimates the intrinsic dimension of the data manifold, and provides a global chart of the manifold. To the best of our knowledge, this is the first scalable framework for extracting the complete geometry of the data manifold.
Lay Summary: Most real-world data—from images to scientific measurements—exists in high-dimensional spaces that can be difficult to work with computationally. However, this data often has hidden structure: it actually lies on a lower-dimensional "manifold" (a curved surface) embedded within the larger space. Think of how a photograph of a face, while containing millions of pixel values, is really constrained by the underlying geometry of human facial features.
This research introduces a new computational framework that can automatically discover both the true dimensionality of data and its geometric structure. The key innovation lies in learning what we call a "pullback Riemannian metric"—essentially a mathematical tool that captures how distances, paths, and angles work on the data's curved surface.
The method builds on Normalizing Flows, a type of machine learning model that can transform complex data distributions into simpler ones. We enhance these flows with two crucial components: regularization that preserves local geometric relationships (ensuring nearby points stay nearby), and a learnable base distribution that automatically identifies which dimensions matter most.
This design creates a natural compression mechanism with a unique capability. During training, the model learns to assign larger importance to dimensions that capture meaningful features and near-zero importance to dimensions containing noise or redundant information. As a result, it simultaneously discovers how many dimensions are actually necessary to represent the data and builds an encoder-decoder system that operates precisely at that intrinsic dimension—something no previous approach has achieved, to our knowledge.
What sets this framework apart is that it goes beyond compression. The learned geometric structure provides closed-form mathematical expressions for computing optimal paths (geodesics), distances, and transformations on the data manifold, giving researchers powerful tools for analysis and interpretation that were previously unavailable.
We tested our approach on datasets ranging from high-dimensional Euclidean datasets to synthetic image manifolds and handwritten digits (MNIST). On datasets that align with the method's assumptions, it accurately estimated the intrinsic dimension and achieved near-perfect compression with minimal information loss. Even on more challenging datasets like MNIST, which deviates from our modeling assumptions due to its multimodal nature spanning ten distinct digit classes, the method demonstrated remarkable versatility by still identifying compact latent representations, albeit with a modest overestimation of the intrinsic dimension.
This work establishes a practical foundation for learning the geometry of high-dimensional data at scale, unlocking computational possibilities that were previously beyond reach. The method provides efficient ways to calculate optimal routes and measure true distances within complex datasets, advancing how artificial intelligence systems understand similarity and structure. It also produces compression systems with clear, interpretable representations of data's essential features. Perhaps most significantly, this framework enables the creation of optimization methods that operate directly on the data's curved surface—enabling novel approaches to any problem that requires optimization while staying within realistic data boundaries. This capability could transform fields ranging from inverse problems like medical image reconstruction to controllable content generation.
Link To Code: https://github.com/GBATZOLIS/Score-Based-Pullback-Riemannian-Geometry
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: data manifold geometry, score-based pullback Riemannian metric, anisotropic normalizing flows, closed-form geodesics, intrinsic dimension estimation, Riemannian auto-encoder, interpretable representation learning
Submission Number: 11209
Loading