PhotoAgent: Exploratory Visual Aesthetic Planning with Large Vision Models

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: image enhancement, image composition; edit
Abstract: With the recent fast development of generative models, instruction-based image editing shows great potential in generating high-quality images. However, the quality of editing highly depends on a carefully designed instruction, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that rethinks the paradigm of autonomous image editing. PhotoAgent autonomously reasons about the necessary edits, generates a robust action plan, and executes adjustments through a closed-loop process, all without requiring detailed step-by-step prompt engineering from the user. Our approach integrates large language models for intentional reasoning and dynamic planning, alongside a vision-language model for precise localized editing. This combination allows PhotoAgent to interpret users' aesthetic intents, decompose them into executable sub-tasks, and refine the output iteratively based on visual feedback. Extensive experiments demonstrate that PhotoAgent significantly outperforms existing methods in both instruction faithfulness and visual quality across a diverse range of editing scenarios.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4224
Loading