Visual Prompting with Iterative Refinement for Design Critique Generation

Peitong Duan; Chin-Yi Cheng; Bjoern Hartmann; Yang Li

Visual Prompting with Iterative Refinement for Design Critique Generation

Peitong Duan, Chin-Yi Cheng, Bjoern Hartmann, Yang Li

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: User Interface Design Critique, Multimodal LLM, Visual Grounding, Prompting Techniques, VLM

TL;DR: We propose a VLM-based framework that iteratively refines text and visual grounding to produce grounded user interface design critiques, outperforming baselines and generalizing to other multimodal tasks.

Abstract: Feedback is essential in all design processes, such as user interface (UI) design. Automating design critiques can significantly enhance design workflow efficiency. Although existing vision language models (VLMs) excel in many tasks, they often struggle with generating high-quality design critiques---a complex task that requires producing detailed design comments that are visually grounded in a given design's image. Building on recent advancements in iterative refinement of text output and visual prompting methods, we propose a multimodal iterative refinement and visual prompting framework for UI critique that takes an input UI screenshot and design guidelines and generates a list of design comments, along with corresponding bounding boxes that map each comment to a specific region in the screenshot. The entire process is driven completely by VLMs, which iteratively refine both the text output and bounding boxes (in a mutually conditioned manner), using few-shot samples tailored for each step. We evaluated our approach using Gemini-1.5-pro and GPT-4o, and found that human experts generally preferred the design critiques generated by our pipeline over those by the baseline, with the pipeline reducing the gap from human performance by 50\% for one rating metric. To assess its generalizability to other multimodal tasks, we applied our pipeline to open-vocabulary object and attribute detection, and experiments showed that our method also outperformed the baseline.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6790

Loading