TL;DR: We propose a novel online data training framework that, for the first time, unifies dynamic data pruning and augmentation to enhance both training efficiency and model generalization.
Abstract: Dynamic data selection aims to accelerate training with lossless performances.
However, reducing training data inherently limits data diversity, potentially hindering generalization.
While data augmentation is widely used to enhance diversity, it is typically not optimized in conjunction with selection.
As a result, directly combining these techniques fails to fully exploit their synergies.
To tackle the challenge, we propose a novel online data training framework that, for the first time, unifies dynamic data selection and augmentation, achieving both training efficiency and enhanced performance.
Our method estimates each sample's joint distribution of local density and multimodal semantic consistency, allowing for the targeted selection of augmentation-suitable samples while suppressing the inclusion of noisy or ambiguous data.
This enables a more significant reduction in dataset size without sacrificing model generalization.
Experimental results demonstrate that our method outperforms existing state-of-the-art approaches on various benchmark datasets and architectures, e.g., reducing 50% training costs on ImageNet-1k with lossless performance.
Furthermore, our approach enhances noise resistance and improves model robustness, reinforcing its practical utility in real-world scenarios.
Lay Summary: Training powerful AI models usually takes massive amounts of data and time. But not all data are equally helpful — some are noisy, repetitive, or hard to learn from. While data selection techniques aim to speed up training by removing less useful samples, this can hurt performance by reducing data diversity. On the other hand, data augmentation improves diversity, but is rarely coordinated with selection.
In this work, we propose a new approach that combines both strategies in a smart, unified way. Our method evaluates how rare and semantically consistent each data point is — for example, whether the image content matches its label — and selects only those that are both informative and suitable for transformation. This ensures the model trains on diverse, high-quality data without being distracted by noise.
Our method significantly reduces training time, especially on large-scale datasets, while maintaining or even improving model performance. This paves the way for more efficient, robust, and accessible AI training in real-world applications.
Primary Area: Deep Learning->Everything Else
Keywords: Dynamic data selection, dataset pruning, data augmentation
Submission Number: 7003
Loading