Keywords: Text-to-image generation, Mulit-agent, Image editing, Multimodality
TL;DR: We propose Mac-Tiger, a multi-agent cooperative framework leveraging MLLMs to iteratively refine text-to-image generation, improving semantic consistency and visual coherence in complex compositional tasks.
Abstract: Recent advancements in text-to-image (T2I) generation have significantly improved image fidelity and alignment with textual prompts, yet challenges remain in addressing complex compositional requirements, such as attribute binding, spatial relationships, and numerical precision. To tackle these issues, this paper introduces Mac-Tiger, a novel multi-agent cooperative framework that leverages multimodal large language models (MLLMs) to optimize T2I generation through iterative refinement. Unlike traditional single-agent approaches, Mac-Tiger employs a tri-agent system—comprising Reviewer, Challenger, and Refiner roles—that collaboratively evaluates and refines prompts based on dynamically generated feedback and multimodal analysis. Key innovations include integrating advanced modules for perception, memory, and cooperative planning to facilitate adaptive prompt optimization. Experiments on benchmarks like T2I-CompBench and MagicBrush demonstrate Mac-Tiger’s superior performance in generating semantically consistent and visually coherent images, particularly in scenarios involving intricate object interactions and detailed edits. This work underscores the potential of multi-agent systems to address long-standing limitations in T2I generation, paving the way for more robust and context-aware generative models.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18695
Loading