Curvature Enhanced Data Augmentation for Regression

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Deep learning models with a large number of parameters, often referred to as over-parameterized models, have achieved exceptional performance across various tasks. Despite concerns about overfitting, these models frequently generalize well to unseen data, thanks to effective regularization techniques, with data augmentation being among the most widely used. While data augmentation has shown great success in classification tasks using label-preserving transformations, its application in regression problems has received less attention. Recently, a novel manifold learning approach for generating synthetic data was proposed, utilizing a first-order approximation of the data manifold. Building on this foundation, we present a theoretical framework and practical tools for approximating and sampling general data manifolds. Furthermore, we introduce the Curvature-Enhanced Manifold Sampling (CEMS) method for regression tasks. CEMS leverages a second-order representation of the data manifold to enable efficient sampling and reconstruction of new data points. Extensive evaluations across multiple datasets and comparisons with state-of-the-art methods demonstrate that CEMS delivers superior performance in both in-distribution and out-of-distribution scenarios, while introducing only minimal computational overhead. Code is available at https://github.com/azencot-group/CEMS.
Lay Summary: Artificial intelligence systems have revolutionized fields from vision to language, but they usually require huge datasets to avoid memorizing instead of learning. One common approach data augmentation generates extra synthetic examples to show the AI system more varied scenarios. While this works well for categorizing tasks (like distinguishing between cats and dogs), it's harder when the system needs to predict continuous values, such as temperatures or stock prices. In our work, we introduce a new method to produce realistic synthetic data for these continuous prediction problems by carefully analyzing how the original data naturally varies. By capturing these natural patterns, we create additional examples that accurately reflect real-world situations. We show that this makes predictions more reliable and accurate across multiple fields, such as environmental monitoring and financial forecasting, without significantly increasing computing resources.
Link To Code: https://github.com/azencot-group/CEMS
Primary Area: Deep Learning->Everything Else
Keywords: Manifold learning, Data augmentation, Regression
Submission Number: 15429
Loading