Abstract: Story visualization aims to create a series of visually consistent images from written stories. With the advancement of large models, visualization techniques have progressed from generating characters limited to predefined datasets to creating open-ended story. Although these techniques have achieved impressive results, several challenges still need to be addressed. For example, maintaining character consistency in scenes with multiple characters and transforming story text into usable information for accurate image generation remain significant challenges. To address these challenges, We propose StoryWeaver, a framework designed for consistent, multi-character, and open-ended story generation. StoryWeaver incorporates trainable prompt templates, allowing the large language model to effectively decompose the story text into the required number of images and corresponding prompts, character descriptions, and story captions for each image. This process ensures that the key elements of the story are accurately translated into visual content. During image generation, we propose a method to ensure character consistency, especially in multi-character scenes. This is achieved by extracting image embeddings from the self-attention layers and converting character prompts into text embeddings using a tokenizer. These embeddings are injected into the target image generation process and applied to both the self-attention and cross-attention layers in a region-specific manner. This approach ensures consistency across characters and prevents interference between them. Users can incorporate pose images or other control inputs, which are seamlessly integrated into the image generation pipeline. Experimental results show that StoryWeaver outperforms existing methods, particularly in maintaining consistency across characters.
External IDs:dblp:conf/icic/ChengYWLX25
Loading