Predict Training Data Quality via Its Geometry in Metric Space

Yang Ba; Mohammad Sadeq Abolhasani; Rong Pan

Predict Training Data Quality via Its Geometry in Metric Space

Yang Ba, Mohammad Sadeq Abolhasani, Rong Pan

Published: 23 Sept 2025, Last Modified: 21 Oct 2025NPGML PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: persistent homology, data quality, diversity measuire, training data selection

Abstract: High-quality training data is the foundation of machine learning and artificial intelligence, shaping how models learn and perform. Although much is known about what types of data are effective for training, the impact of the data's geometric structure on model performance remains largely underexplored. We propose that both the richness of representation and the elimination of redundancy within training data critically influence learning outcomes. To investigate this, we employ persistent homology to extract topological features from data within a metric space, thereby offering a principled way to quantify diversity beyond entropy-based measures. Our findings highlight persistent homology as a powerful tool for analyzing and enhancing the training data that drives AI systems.

Submission Number: 90

Loading