AgentStory: A Multi-Agent System for Story Visualization with Multi-Subject Consistent Text-to-Image Generation
Abstract: Story visualization aims to create visual content, such as images and videos, that is consistent, coherent, and complete with a given story. Despite significant advances in the application of diffusion models for general text-to-image generation tasks, they still encounter difficulties when directly used to produce consistent visual content that accurately aligns with the narrative text. In this paper, we propose a novel training-free automated story visualization framework called AgentStory that can generate image illustrations based on a story synopsis provided by users. Specifically, the framework employs multiple agents empowered by Large Language Models (LLMs) to create detailed descriptions of each subject and scene in the entire story. Then, it integrates a masking mechanism with a fine-grained consistency refinement adapter to incorporate different subjects in a scene. Furthermore, it utilizes the visual understanding capabilities of multimodal LLMs to include detailed features of different subjects in the refinement adapter, thus improving the consistency of each subject across multiple scenes. Finally, we compare the AgentStory framework with state-of-the-art baselines for story visualization on the DS-500 dataset and demonstrate its superior performance in terms of subject consistency, text-image alignment, and aesthetic quality. Our code is publicly available at https://github.com/tc2000731/AgentStory.
External IDs:doi:10.1145/3731715.3733271
Loading