Abstract: The performance of deep learning models is intrinsically tied to the quality, volume, and relevance of their training data. Gathering ample data for production scenarios of-ten demands significant time and resources. Among various strategies, data augmentation circumvents exhaustive data collection by generating new data points from existing ones. However, traditional augmentation techniques can be less effective amidst a shift in training and testing distributions. This paper explores the potential of synthetic data by leveraging large pre-trained models for data augmentation, especially when confronted with distribution shifts. Al-though recent advancements in generative models have en-abled several prior works in cross-distribution data gener-ation, they require model fine-tuning and a complex setup. To bypass these shortcomings, we introduce Domain Gap Embeddings (DoGE), a plug-and-play semantic data aug-mentation framework in a cross-distribution few-shot set-ting. Our method extracts disparities between source and desired data distributions in a latent form, and subsequently steers a generative process to supplement the training set with endless diverse synthetic samples. Our evaluations, conducted on a subpopulation shift and three domain adap-tation scenarios under afew-shot paradigm, reveal that our versatile method improves performance across tasks with-out needing hands-on intervention or intricate fine-tuning. DoGE paves the way to effortlessly generate realistic, con-trollable synthetic datasets following the test distributions, bolstering real-world efficacy for downstream task models.
External IDs:dblp:conf/cvpr/WangCWT24
Loading