How far can we go with ImageNet for Text-to-Image generation?

Lucas Degeorge; Arijit Ghosh; Nicolas Dufour; David Picard; Vicky Kalogeiton

How far can we go with ImageNet for Text-to-Image generation?

Lucas Degeorge, Arijit Ghosh, Nicolas Dufour, David Picard, Vicky Kalogeiton

04 Apr 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text-to-image diffusion models

TL;DR: Training text-to-image diffusion models only on Imagenet

Abstract: Recent text-to-image (T2I) generation models have achieved remarkable results by training on billion-scale datasets, following a `bigger is better' paradigm that prioritizes data quantity over availability (closed vs open source) and reproducibility (data decay vs established collections). We challenge this established paradigm by demonstrating that one can match or outperform models trained on massive web-scraped collections, using only ImageNet enhanced with well-designed text and image augmentations. With this much simpler setup, we achieve a $+1\%$ overall score over SD-XL on GenEval and $+0.5\%$ on DPGBench while using just 1/10th the parameters and 1/1000th the training images. This opens the way for more reproducible research as ImageNet is a widely available dataset and our standardized training setup does not require massive compute resources.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 369

Loading