TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization
Abstract: We present TALE, a novel training-free framework harnessing the power of text-driven diffusion models to tackle cross-domain image composition task that aims at seamlessly incorporating user-provided objects into a specific visual context regardless of domain disparity. Previous methods often involve either training auxiliary networks or finetuning diffusion models on customized datasets, which are expensive and may undermine the robust textual and visual priors of pretrained diffusion models. Some recent works attempt to break the barrier by proposing training-free workarounds that rely on manipulating attention maps to tame the denoising process implicitly. However, composing via attention maps does not necessarily yield desired compositional outcomes. These approaches could only retain some semantic information and usually fall short in preserving identity characteristics of input objects or exhibit limited background-object style adaptation in generated images. In contrast, TALE is a novel method that operates directly on latent space to provide explicit and effective guidance for the composition process to resolve these problems. Specifically, we equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization. The former formulates noisy latents conducive to initiating and steering the composition process by directly leveraging background and foreground latents at corresponding timesteps, and the latter exploits designated energy functions to further optimize intermediate latents conforming to specific conditions that complement the former to generate desired final results. Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition across various photorealistic and artistic domains.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Vision and Language, [Experience] Art and Culture, [Experience] Multimedia Applications
Relevance To Conference: The novel TALE method presented in this work is of considerable relevance to the theme "Multimedia in the Generative AI Era" and the interests of the conference, which focuses on advancing the state-of-the-art in image synthesis and manipulation. TALE addresses a critical challenge in the domain of text-driven diffusion models and image-guided composition: the difficulty of explicitly steering the denoising process using user-specified text prompts and images to generate desired composited results without training. This is a prominent issue as prior methods often compromise the rich visual and textual priors and fail to maintain semantic and identity characteristics during composition. TALE's relevance is further underscored by its innovative approach to composing in latent space. Unlike existing methods that rely on attention maps and often yield suboptimal results, TALE provides an explicit framework for image guidance that allows for more precise control over the composition process, thereby excelling in preserving identity characteristics and facilitating style adaptation between background and object.
Supplementary Material: zip
Submission Number: 2425
Loading