TL;DR: Proactive agents for multi-turn uncertainty-aware text-to-image generation with an interface to ask questions when uncertain and present agent beliefs so users can directly edit
Abstract: User prompts for generative AI models are often underspecified, leading to a misalignment between the user intent and models' understanding. As a result, users commonly have to painstakingly refine their prompts. We study this alignment problem in text-to-image (T2I) generation and propose a prototype for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their uncertainty about user intent as an understandable and editable belief graph. We build simple prototypes for such agents and propose a new scalable and automated evaluation approach using two agents, one with a ground truth intent (an image) while the other tries to ask as few questions as possible to align with the ground truth. We experiment over three image-text datasets: ImageInWords (Garg et al., 2024) , COCO (Lin et al., 2014) and DesignBench, a benchmark we curated with strong artistic and design elements. Experiments over the three datasets demonstrate the proposed T2I agents' ability to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard T2I generation. Moreover, we conducted human studies and observed that at least 90\% of human subjects found these agents and their belief graphs helpful for their T2I workflow, highlighting the effectiveness of our approach. Code and DesignBench can be found at https://github.com/google-deepmind/proactive_t2i_agents.
Lay Summary: When you ask an AI to create an image, it often gets it wrong because your description isn't specific enough. This means you have to keep trying different prompts, which can be frustrating.
For example lets say you ask it for “an image of a dog in a park” the AI has to guess things like the type of dog you want, or what time of day the picture should be. Instead of rewriting the prompt over and over, our paper proposes an AI that proactively asks questions to clarify aspects of the image before generating it. For example the AI would ask about which type of dog you want to be shown.
In addition to asking the questions, the AI we propose also shows you a simple and editable flow chart of the image. We call this image chart a “belief graph,” and it has the elements of the image so the user can directly change the AI’s understanding of the image.
Tests show that this new method is twice as effective at creating the image the user actually wants. Furthermore, when real people tried it, over 90% said the AI's questions and the "belief graph" were very helpful for getting the image they imagined. In short, it's like having a helpful conversation with an AI artist instead of just giving it commands and hoping for the best.
Link To Code: https://github.com/google-deepmind/proactive_t2i_agents
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: Prompt Underspecification, Interpretability, Explainable AI, Dialog, Human-AI-Interaction, Agents
Flagged For Ethics Review: true
Submission Number: 12385
Loading