Keywords: GUI Agent Benchmark,Multimodal large language model,Image editing
Abstract: Photoshop is a powerful and widely used professional software for image editing, design, and creative production. Its complex multi-level menu structure, extensive set of graphical processing tools, and reliance on precise manipulations make automated operation and agent interaction particularly challenging. Despite recent progress in GUI agents, existing datasets and methods are primarily designed for web-based interfaces and short-horizon, low-complexity tasks in operating systems, falling short in addressing the fine-grained control, multi-step decision-making, and semantic understanding required in professional graphic software. To this end, we propose the first benchmark specifically tailored for image editing in Adobe Photoshop environment, with a particular focus on its core principle of non-destructive editing through layers. The benchmark consists of 600 human-annotated tasks, spanning three difficulty levels. Easy and medium tasks are distilled from official Photoshop tutorials, capturing the most common basics.
Hard tasks are directly collected from the most popular Photoshop tutorials in Youtube, ensuring both challenge and real-world relevance. Task categories cover fundamental functionalities such as canvas adjustment, layer manipulation, and filter application, each accompanied by dedicated fine-grained evaluation metrics. Through various experiments in PSBench, we find that current leading MLLMs, like Qwen2.5-VL, GPT-5 and Gemini-2.5-Pro, exhibit generally low task success rates but can demonstrate remarkable planning ability. Further via a human-in-loop experiment, we find that MLLMs can serve as highly effective Photoshop assistants, substantially boosting novice users’ task success rates while dramatically reducing their operation time.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1921
Loading