VersaFusion: A Versatile Diffusion-Based Framework for Fine-Grained Image Editing and Enhancement

Published: 01 Jan 2025, Last Modified: 30 Jul 2025AAAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Text-to-image (T2I) diffusion models have achieved remarkable progress in generating realistic images from textual descriptions. However, ensuring consistent high-quality image generation with complete backgrounds, object appearance, and optimal texture rendering remains challenging. This paper presents a novel fine-grained pixel-level image editing method based on pre-trained diffusion models. The proposed dual-branch architecture, consisting of Guidance and Generation branches, employs U-Net Denoisers and Self-Attention mechanisms. An improved DDIM-like inversion method obtains the latent representation, followed by multiple denoising steps. Cross-branch interactions, such as KV Replacement, Classifier Guidance, and Feature Correspondence, enable precise control while preserving image fidelity. The iterative refinement and reconstruction process facilitates finegrained editing control, supporting attribute modification, image outpainting, style transfer, and face synthesis with Clickand-Drag style editing using masks. Experimental results demonstrate the effectiveness of the proposed approach in enhancing the quality and controllability of T2I-generated images, surpassing existing methods while maintaining attractive computational complexity for practical real-world applications.
Loading