Keywords: text-to-video generation; multi-character personalization
TL;DR: Our work explores multi-character text-to-video generation (e.g., mixing Tom and Jerry with Mr. Bean), preserving individual identities and personalized motions while enabling smooth, natural interactions.
Abstract: Imagine Mr. Bean stepping into Tom and Jerry---can we generate videos where characters interact naturally across different worlds? We study inter-character interaction in text-to-video generation, where the key challenge is to preserve each character’s identity and behaviors while enabling coherent cross-context interaction. This is difficult because characters may never have coexisted and because mixing styles often causes **style delusion**, where realistic characters appear cartoonish or vice versa. We introduce a framework that tackles these issues with Cross-Character Embedding (CCE), which learns identity and behavioral logic across multimodal sources, and Cross-Character Augmentation (CCA), which enriches training with synthetic co-existence and mixed-style data. Together, these techniques allow natural interactions between previously uncoexistent characters without losing stylistic fidelity. Experiments on a curated benchmark of cartoons and live-action series with 10 characters show clear improvements in identity preservation, interaction quality, and robustness to style delusion, enabling new forms of generative storytelling. Our project page https://mi-mi-x.github.io/.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 3735
Loading