ComposeAnything: Composite Object Priors for Text-to-Image Diffusion Models

ComposeAnything: Composite Object Priors for Text-to-Image Diffusion Models

ICLR 2026 Conference Submission17304 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion models, text to image generation, image layout with LLM planning, structured noise initialization, prior guided diffusion

TL;DR: We propose to use "2.5D Composite object priors for structured noise initialization along with prior guided diffusion and spatial controlled denoising for out of distribution highly surreal compositional image generatio

Abstract: Generating images from text with complex object arrangements remains a major challenge for current text-to-image (T2I) models. Existing training-based solutions, such as layout-conditioned models or reinforcement learning methods, improve compositional accuracy but often distort realism, leading to floating objects, broken physics, and degraded image quality. In this work, we introduce ComposeAnything, an inference-only framework that enhances compositional generation without retraining. Our key idea is to replace stochastic noise initialization with \emph{composite object priors}— interpretable structured composite of objects, created using 2.5D layouts generated from large language models and pretrained image generators. We further propose prior-guided diffusion, which integrates these priors into the denoising process to enforce compositional correctness while preserving visual fidelity. This training-free strategy enables seamless generation of compositional objects and coherent backgrounds, while allowing refinement of inaccurate priors. ComposeAnything consistently outperforms state-of-the-art inference-only methods on T2I-CompBench and NSR-1K benchmarks, especially for prompts with complex spatial relations, high object counts, and surreal scenes. Human evaluations confirm that our method generates images that are not only compositionally faithful but also visually coherent.

Primary Area: generative models

Submission Number: 17304

Loading