Keywords: Image editing, text-in-image, multi-modal agent, A* search, cost-sensitive, vision-language models
TL;DR: A cost-sensitive tool-calling agent that can find efficient plans of multimodal tool-use with different quality-cost trade-off, for image/text-in-image editing tasks challenging to existing models/agents.
Abstract: Text-to-image models like Stable Diffusion and DALLE-3 still struggle with complex multi-turn image editing. We study how to break down such a task into a sequence of subtasks and address them by an agentic workflow (path) of AI tool use with minimum costs.
Conventional search algorithms require expensive exploration to find tool paths. While large language models (LLMs) possess prior knowledge of subtask planning, their estimation of the quality and cost of tools is usually inaccurate to determine which to apply in each subtask. $\textit{Can we combine the strengths of both LLMs and graph search to find cost-efficient tool paths?}$ We propose a three-stage approach``CoSTA*'' that leverages LLMs to create a subtask tree that prunes a graph of AI tools for the given task, and then conducts A* search on the small subgraph to find a tool path. To better balance the total cost and quality, CoSTA* combines both metrics of each tool on every subtask to guide the A* search. Each subtask's output is evaluated by a vision-language model (VLM), where a failure will trigger an update of the tool's cost and quality on that subtask. Hence, the A* search can recover from failures quickly to explore other paths. Moreover, CoSTA* can automatically switch between modalities across subtasks for a better cost-quality trade-off. We build a novel benchmark of challenging multi-turn image editing, on which CoSTA* outperforms state-of-the-art image-editing models or agents in both cost and quality, and performs versatile trade-offs upon user preference. Our dataset and a hosted demo can be found at https://storage.googleapis.com/costa-frontend/index.html.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 14594
Loading