LoMOE: Localized Multi-Object Editing via Multi-Diffusion

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent developments in diffusion models have demonstrated an exceptional capacity to generate high-quality prompt-conditioned image edits. Nevertheless, previous approaches have primarily relied on textual prompts for image editing, which tend to be less effective when making precise edits to specific objects or fine-grained regions within a scene containing single/multiple objects. We introduce a novel framework for zero-shot localized multi-object editing through a multi-diffusion process to overcome this challenge. This framework empowers users to perform various operations on objects within an image, such as adding, replacing, or editing $\textbf{many}$ objects in a complex scene $\textbf{in one pass}$. Our approach leverages foreground masks and corresponding simple text prompts that exert localized influences on the target regions resulting in high-fidelity image editing. A combination of cross-attention and background preservation losses within the latent space ensures that the characteristics of the object being edited are preserved while simultaneously achieving a high-quality, seamless reconstruction of the background with fewer artifacts compared to the state-of-the-art (SOTA). We also curate and release a dataset dedicated to multi-object editing, named $\texttt{LoMOE}$-Bench. Our experiments against existing SOTA demonstrate the improved effectiveness of our approach in terms of both image editing quality, and inference speed.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Localized multi-object editing significantly contributes to multimedia and multimodal processing by enabling efficient manipulation of multiple objects within the same context. This capability enhances creativity and productivity, allowing users to make complex edits quickly and experiment with different compositions seamlessly. It promotes seamless integration of various modalities, ensuring coherence in multimedia presentations. It fosters personalized user experiences by empowering users to customize content according to their preferences. Localized multi-object editing is pivotal in advancing multimedia and multimodal processing, fostering creativity, efficiency, personalization, and democratization of high-quality editing. As technology evolves, its role in various applications and industries is likely to expand further.
Supplementary Material: zip
Submission Number: 2989
Loading