Keywords: Applications of modularity, Text-to-image generation
TL;DR: We teach LLMs to compose multi-model text-to-image workflows from community trained, specialized models.
Abstract: The practical use of text-to-image generation has evolved from simple, monolithic models to complex workflows combining multiple specialized components. These components are independently trained by different practitioners to excel at specific tasks – from improving photorealism or anime-style generation to fixing common artifacts like malformed hands. Using these components to craft effective workflows requires significant expertise due to the large number of available models and their complex interdependencies. We introduce prompt-adaptive workflow generation, where the goal is to automatically tailor a workflow to each user prompt by intelligently selecting and combining these specialized components. We propose two LLM-based approaches: a tuning-based method, and an in-context approach. Both approaches lead to improved image quality compared to monolithic models or generic workflows, demonstrating that prompt-dependent flow prediction offers a new pathway to improving text-to-image generation.
Submission Number: 14
Loading