Keywords: Diffusion Models, Fair Composition, Game-Theoretic, Text-to-Image
TL;DR: a game-theoretic approach to compositional sampling from multiple pre-trained diffusion models
Abstract: With the widespread availability of pre-trained diffusion models, there are many options for which models to use and how to use them together. Making these decisions depends highly on both the user's goals and the expertise of each model. Taking this into account, we propose coordinating models as one would a specialized workforce--through a fair yet efficient division of labor. Divide-and-Denoise uses multiple pre-trained diffusion models, each defined over the same space, to refine a noisy sample over time. At every timestep, we alternate between (i) dividing the sample into regions in a way that satisfies our game-theoretic criteria and (ii) denoising a region with the assigned model in a way that respects our alignment criteria. This leads to a new composite denoising process that evolves together with a division process. Since ground truth is typically not available for our setup, we measure how well Divide-and-Denoise coordinates a team of single-concept text-to-image diffusion models relative to a multi-concept model. On the GenEval benchmark, our method generates images that capture the strengths of each model, outperforming baselines and resolving common failures like missing objects and mismatched attributes.
Primary Area: generative models
Submission Number: 10873
Loading