A Plug-and-Play Agentic Framework for Text Guided Image Editing

Dibyanayan Bandyopadhyay; Arnab Kumar Mondal; Saswati Dana; Udit Sharma; Prathosh AP; Dinesh Garg; Amith Singhee

A Plug-and-Play Agentic Framework for Text Guided Image Editing

Dibyanayan Bandyopadhyay, Arnab Kumar Mondal, Saswati Dana, Udit Sharma, Prathosh AP, Dinesh Garg, Amith Singhee

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: text guided image editing, agentic image editing, diffusion model

Abstract: Text-guided image editing with diffusion models struggles to maintain fidelity during complex, multi-aspect edits, where simultaneous changes and preservations are required. While one-shot prompt rewriting offers some improvement, it lacks fine-grained control, often leading to under-editing of desired attributes or over-editing of unrelated regions, or both. To address these gaps, we introduce an agent called `ARTIE` (**A**uditable **R**efinement for **T**ext-Guided **I**mage **E**diting), which is a plug-and-play, inference-time, feedback-based agentic system to enhance pre-trained diffusion models such as Stable Diffusion. At its core, `ARTIE` is organized into three agentic sub-modules: (1) a perception module (`SceneDiff`), which detects over-editing by comparing source and target scene graphs, and under-editing by grounding edit requirements through a dual-verification pipeline comprising an open-set object detector and CLIP; (2) a reasoning/planner module (an LLM-based Prompt Engineer), which takes the diagnostic signals from `SceneDiff` and synthesizes refined positive prompts together with asymmetrically weighted negative prompts; and (3) an action module (the image generator), which executes these refined prompts to produce improved images iteratively. This perception–reasoning–action loop runs in multiple cycles, producing high-quality edited images. Consequently, `ARTIE` also yields an auditable trail of refinement steps where each modification is explained and justified via explicit feedback signals from the perception module. Further, `ARTIE` operates solely through guided prompt engineering, without requiring model retraining or fine-tuning, making it a plug-and-play architecture. Despite being training-free, when applied on top of Stable Diffusion, `ARTIE` consistently improves fidelity and control in multi-aspect editing. Its performance matches or surpasses specialized baselines, thereby setting a new state-of-the-art for explainable, inference-time agentic image editing.

Primary Area: generative models

Submission Number: 9021

Loading