Keywords: text guided image editing, agentic image editing, diffusion model
Abstract: Text-guided image editing with diffusion models struggles to maintain fidelity during complex, multi-aspect edits, where simultaneous changes and preservations are required. While one-shot prompt rewriting offers some improvement, it lacks fine-grained control, often leading to under-editing of desired attributes or over-editing of unrelated regions, or both. To address these gaps, we introduce an agent called `ARTIE` (**A**uditable **R**efinement for **T**ext-Guided **I**mage **E**diting), which is a plug-and-play, inference-time, feedback-based agentic system to enhance pre-trained diffusion models such as Stable Diffusion. At its core, `ARTIE` is organized into three agentic sub-modules: (1) a perception module (`SceneDiff`), which detects over-editing by comparing source and target scene graphs, and under-editing by grounding edit requirements through a dual-verification pipeline comprising an open-set object detector and CLIP; (2) a reasoning/planner module (an LLM-based Prompt Engineer), which takes the diagnostic signals from `SceneDiff` and synthesizes refined positive prompts together with asymmetrically weighted negative prompts; and (3) an action module (the image generator), which executes these refined prompts to produce improved images iteratively. This perception–reasoning–action loop runs in multiple cycles, producing high-quality edited images. Consequently, `ARTIE` also yields an auditable trail of refinement steps where each modification is explained and justified via explicit feedback signals from the perception module. Further, `ARTIE` operates solely through guided prompt engineering, without requiring model retraining or fine-tuning, making it a plug-and-play architecture. Despite being training-free, when applied on top of Stable Diffusion, `ARTIE` consistently improves fidelity and control in multi-aspect editing. Its performance matches or surpasses specialized baselines, thereby setting a new state-of-the-art for explainable, inference-time agentic image editing.
Primary Area: generative models
Submission Number: 9021
Loading