Generative Data Mining with Longtail-Guided Diffusion

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Proactive longtail discovery and mitigation with model-guided synthetic training data generation
Abstract: It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on numerous image classification benchmarks, and can be analyzed by a VLM to proactively discover, textually explain, and address conceptual gaps in a deployed predictive model.
Lay Summary: We imbue predictive AI models with the ability to continuously dream up additional hard or rare data that can be used as additional training data for improving their own capabilities in uncommon real-world scenarios. We further provide techniques to reduce those hard or rare data to textual descriptions so that humans can better anticipate what an AI model might struggle with before release.
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Synthetic Data, Longtail, Long Tail, Foundation Model, Diffusion, Guidance, VLM, CLIP, Embedding, Text, Robustness, Uncertainty, Epistemic, Aleatoric, Text, Autolabel, Imagine, Imagination, Dream
Submission Number: 4028
Loading