Keywords: diffusion model, personalized consistent image generation, consistent character generation, manga generation
Abstract: The generation of consistent characters across an entire manga page is important yet challenging, as characters must remain coherent under diverse poses, actions, and layouts. Unlike conventional face or human consistency methods that focus on isolated portraits, this broader narrative setting cannot be directly addressed by per-subject fine-tuning or narrowly scoped identity-preservation techniques. We introduce \textbf{MangaCrafter}, a 3-stage training-free framework that achieves layout-aware, multi-character manga generation by altering the denoising processes of latent diffusion. Our key insight is that character consistency can be secured not through persistent identity injection but through a staged control of the diffusion trajectory that front-loads identity anchoring while gradually relaxing constraints to enable expressive, prompt-driven detail. In \textbf{Stage 1}, \emph{Structural Resonance Injection (SRI)} augments the UNet’s attention with cached reference features to robustly establish structural fidelity in the high-noise regime. The centerpiece of our contribution lies in \textbf{Stage 2}, where the \emph{Predictive Drift Controller (PDC)}, a proportional-integral-derivative feedback system, dynamically measures feature drift between the evolving latent and the reference to modulate the denoising process, ensuring robust identity preservation while suppressing “pasted-on” and “blurry” artifacts. Finally, in \textbf{Stage 3}, we strategically zero out reference injections, transferring identity control to the early imprints while allowing the model to synthesize fine, prompt-driven details without over-similarity. Together with a lightweight preprocessing workflow that resolves multi-character fusion, MangaCrafter delivers training-free, consistent yet flexible manga synthesis and suggests a general paradigm for controlled narrative generation across diffusion-based media. Extensive experiments on the challenging ConsiStory+ benchmark show that our framework achieves state-of-the-art identity preservation while maintaining high prompt fidelity. Ablations confirm the effectiveness of our staged design in balancing consistency, diversity, and aesthetic quality.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 8568
Loading