DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models
Abstract: Existing story visualization systems often rely on text-only control, which makes it difficult to decide where multiple characters should appear and to keep their visual appearance consistent across panels. DreamingComics addresses this by introducing a layout-aware framework that jointly reasons about subject identity, spatial layout, and artistic style for comic-style stories.
We build on a pretrained video diffusion transformer and repurpose it for image customization, leveraging its spatiotemporal priors to improve identity and style consistency across generated panels. To control spatial layout, we introduce RegionalRoPE, a region-aware rotary position embedding that re-indexes reference tokens according to target bounding boxes, and a masked condition loss that penalizes attention that leaks outside the designated regions. Complementing this, an LLM-based layout generator is fine-tuned on comic layout data to predict panel and character boxes directly from textual scripts, reducing the need for manual layout design.
On benchmarks such as ViStoryBench and DreamBench++, DreamingComics improves character consistency and style similarity over prior story visualization and image customization methods, while maintaining high spatial accuracy.
Loading