Preference Adaptive and Sequential Text-to-Image Generation

Ofir Nabati; Guy Tennenholtz; ChihWei Hsu; Moonkyung Ryu; Deepak Ramachandran; Yinlam Chow; Xiang Li; Craig Boutilier

Preference Adaptive and Sequential Text-to-Image Generation

Ofir Nabati, Guy Tennenholtz, ChihWei Hsu, Moonkyung Ryu, Deepak Ramachandran, Yinlam Chow, Xiang Li, Craig Boutilier

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce PASTA, a multimodal LLM based RL agent, for sequential, adaptive text-to-image generation. PASTA improves user intent via iterative prompt refinements. We also release a new dataset for sequential text-to-image generation.

Abstract: We address the problem of interactive text-to-image (T2I) generation, designing a reinforcement learning (RL) agent which iteratively improves a set of generated images for a user through a sequence of prompt expansions. Using human raters, we create a novel dataset of sequential preferences, which we leverage, together with large-scale open-source (non-sequential) datasets. We construct user-preference and user-choice models using an EM strategy and identify varying user preference types. We then leverage a large multimodal language model (LMM) and a value-based RL approach to suggest an adaptive and diverse slate of prompt expansions to the user. Our Preference Adaptive and Sequential Text-to-image Agent (PASTA) extends T2I models with adaptive multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user's intent. We evaluate PASTA using human raters, showing significant improvement compared to baseline methods. We also open-source our sequential rater dataset and simulated user-rater interactions to support future research in user-centric multi-turn T2I systems.

Lay Summary: Have you ever tried to create an image using AI from a text description, only for the result to not quite match what you had in mind? It's often difficult to perfectly convey complex or evolving artistic visions with a single instruction, leading to a frustrating trial-and-error process. We've developed an AI assistant called PASTA that learns your preferences through a more conversational approach to image generation. Instead of just one attempt, PASTA shows you several image options based on your initial idea. You then pick the images you like best, and PASTA uses this feedback to refine its suggestions over several turns, guiding the image generation closer to your desired outcome. To build this, we collected new data on how people make these sequential choices and even created simulated users to help train our AI. This research makes image generation with AI a more collaborative and intuitive experience. It allows users to better express their specific ideas, helping them bring complex or abstract visions to life more effectively. Ultimately, this work aims to make AI image generation tools more satisfying and better aligned with individual user intent, and we're sharing our data to help other researchers build even more advanced creative AI.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Deep Learning->Large Language Models

Keywords: text-to-image generation, reinforcement learning, diffusion models, multi-modal large language models, human-in-the-loop, sequential decision making

Submission Number: 10020

Loading