Keywords: Text-to-Image, Compositional, Curriculum Learning
TL;DR: We introduce CompGen, a curriculum learning framework that expands scaling boundaries for compositional text-to-image generation using scene graphs and adaptive difficulty-aware sampling.
Abstract: Text-to-Image (T2I) generation has long been an open problem, with compositional synthesis remaining particularly challenging. This task requires accurate rendering of complex scenes containing multiple objects that exhibit diverse attributes as well as intricate spatial and semantic relationships, demanding both precise object placement and coherent inter-object interactions. In this paper, we propose a compositional curriculum learning framework for T2I generation (CompGen) to address compositional weaknesses in T2I models. Specifically, we leverage scene graphs and introduce a novel difficulty criterion along with a corresponding adaptive Markov Chain Monte Carlo (MCMC) graph sampling algorithm. Using this difficulty-aware approach, we generate training datasets for Group Relative Policy Optimization (GRPO) comprising prompts and question-answer pairs with varying complexity levels. We demonstrate that different training schedulers yield distinct scaling curves for GRPO, with data distributions following easy-to-hard progression or gaussian sampling strategies producing superior scaling performance than random. Our extensive experiments demonstrate that CompGen significantly strengthens compositional generation capabilities for both diffusion and auto-regressive T2I models, which highlights its effectiveness in enhancing the compositional understanding of T2I generation systems.
Primary Area: generative models
Submission Number: 18923
Loading