BRADD: Balancing Representations with Anomaly Detection and Diffusion

Published: 06 May 2025, Last Modified: 06 May 2025SynData4CVEveryoneRevisionsBibTeXCC BY 4.0
Keywords: imbalanced data, self-supervised training
Abstract: Self-supervised learning (SSL) has allowed for advancements in language processing and computer vision, as unlabelled data is available in large quantities. However, imbalances in training datasets can lead to strong biases in the learned features of pre-trained models. Previous results show that pre-training using imbalanced data can also hurt downstream performance. We propose a data-centric approach: our method trains on the data, finds underrepresented samples, and uses diffusion to generate novel data complementing the underrepresented images. Our proposed method, BRADD (Balancing Representations through Anomaly Detection and Diffusion), utilizes distance-based outlier detection to identify regions of the embedding space that are underrepresented in each training cycle. Experimental results on ImageNet-100-LT demonstrate that BRADD consistently outperforms both balanced and imbalanced baselines, with significant improvements on fine-grained classification tasks. Detailed ablation studies confirm that both out-of-distribution sample selection and diffusion-based generation contribute substantially to the effectiveness of our approach, offering a promising alternative to model-centric solutions for addressing imbalance in self-supervised learning.
Submission Number: 72
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview