AdaDeDup: Adaptive Hybrid Data Pruning for Efficient Object Detection Training

16 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data Pruning, Data-centric AI, Data Curation, Object Detection, Data Selection
TL;DR: We propose Adaptive De-Duplication (AdaDeDup), a novel hybrid framework that synergistically integrates density-based pruning with model-based feedback in a cluster-adaptive manner.
Abstract: Training contemporary machine learning models on large-scale datasets incurs significant computational costs and often contends with data redundancy. Data pruning aims to alleviate this by selecting smaller, informative subsets. However, existing methods face challenges: density-based approaches can be task-agnostic, while model-based ones may retain redundancy or be computationally expensive. To address these limitations, we propose Adaptive De-Duplication (AdaDeDup), a novel hybrid framework that synergistically integrates density-based pruning with model-based feedback in a cluster-adaptive manner. AdaDeDup first partitions samples in some embedding space, then uses a proxy model to estimate the impact of initial density-based pruning within each cluster by comparing losses on kept versus pruned samples. This signal adaptively adjusts cluster-specific pruning thresholds, enabling more aggressive pruning in redundant clusters and preserving data in informative ones. We conduct extensive experiments on large-scale object detection benchmarks, including Waymo, COCO, and nuScenes datasets, using standard models such as BEVFormer and Faster R-CNN. AdaDeDup significantly outperforms prominent baselines from embedding similarity-based deduplication to state-of-the-art semantic deduplication across various pruning ratios. Notably, AdaDeDup substantially reduces performance degradation compared to baselines (e.g., by over 54% vs. random sampling on Waymo), and it achieves near-original model performance while pruning 20% of the data, highlighting its effectiveness in improving data efficiency for large-scale model training. Code is open-sourced.
Supplementary Material: pdf
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 8101
Loading