Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Fine-grained Visual Classification, Data Augmentation, Synthetic Data, Diffusion Models, Image Classification
TL;DR: Our paper introduces SaSPA, a novel data augmentation (DA) method for fine-grained visual classification that surpasses existing DA methods by generating synthetic data that is both more diverse and accurately represents fine-grained classes.
Abstract: Fine-grained visual classification (FGVC) involves classifying closely related subcategories. This task is inherently difficult due to the subtle differences between classes and the high intra-class variance. Moreover, FGVC datasets are typically small and challenging to gather, thus highlighting a significant need for effective data augmentation. Recent advancements in text-to-image diffusion models have introduced new possibilities for data augmentation in image classification. While these models have been used to generate training data for classification tasks, their effectiveness in full-dataset training of FGVC models remains under-explored. Recent techniques that rely on text-to-image generation or Img2Img methods, such as SDEdit, often struggle to generate images that accurately represent the class while modifying them to a degree that significantly increases the dataset's diversity. To address these challenges, we present SaSPA: Structure and Subject Preserving Augmentation. Contrary to recent methods, our method does not use real images as guidance, thereby increasing generation flexibility and promoting greater diversity. To ensure accurate class representation, we employ conditioning mechanisms, specifically by conditioning on image edges and subject representation. We conduct extensive experiments and benchmark SaSPA against both traditional and generative data augmentation techniques. SaSPA consistently outperforms all established baselines across multiple settings, including full dataset training and contextual bias. Additionally, our results reveal interesting patterns in using synthetic data for FGVC models; for instance, we find a relationship between the amount of real data used and the optimal proportion of synthetic data.
Supplementary Material: zip
Primary Area: Machine vision
Submission Number: 7340
Loading