Keywords: Computer graphics, Image generation, Controllabale generation, Customization
TL;DR: We tackle multi-image layout control by letting a diffusion model learn where to place elements based on reference images rather than text
Abstract: Text-to-image models have reached a level of realism that enables highly convincing image generation. However, text-based control can be a limiting factor when more explicit guidance is needed. Defining both the content and its precise placement within an image is crucial for achieving finer control. In this work, we address the challenge of multi-image layout control, where the desired content is specified through images rather than text, and the model is guided on where to place each element. Our approach is training-free, requires a single image per reference, and provides explicit and simple control for object and part-level composition. We demonstrate its effectiveness across various image composition tasks.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 6238
Loading