Contrastive Visual Data Augmentation

Yu Zhou; Bingxuan Li; Tang Mohan; Xiaomeng Jin; Te-Lin Wu; Kuan-Hao Huang; Heng Ji; Kai-Wei Chang; Nanyun Peng

Contrastive Visual Data Augmentation

Yu Zhou, Bingxuan Li, Tang Mohan, Xiaomeng Jin, Te-Lin Wu, Kuan-Hao Huang, Heng Ji, Kai-Wei Chang, Nanyun Peng

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce the Contrastive Visual Data Augmentation strategy and the NovelSpecies dataset.

Abstract: Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their ability to recognize and reason about novel or rare concepts, we propose a Contrastive visual Data Augmentation (CoDA) strategy. CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data. Automatic filtering of extracted features and augmented images is implemented to guarantee their quality, as verified by human annotators. We show the effectiveness and efficiency of CoDA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect NovelSpecies, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. LLaVA-1.6 1-shot updating results on these three datasets show CoDA significantly improves SOTA visual data augmentation strategies by 12.3% (NovelSpecies), 5.1% (SUN), and 6.0% (iNat) absolute gains in accuracy.

Lay Summary: AI models with image understanding ability still mix up rare or brand-new objects because their training images lack the subtle cues humans notice. Our method automatically finds the key traits that distinguish a new concept from the look-alike it’s confused with, then generates a few synthetic images that highlight those cues. After filtering for quality, a single synthetic image can be used to teach image understanding AI models to improve recognition accuracy by 12 % on unseen animal species and by 5–6 % on two standard benchmarks. CoDA offers a fast, low-cost way to keep AI systems current.

Link To Code: https://contrastive-visual-data-augmentation.github.io/

Primary Area: Applications->Computer Vision

Keywords: Data Augmentation, Text-to-Image Generation, Feature Extraction

Flagged For Ethics Review: true

Submission Number: 672

Loading