Abstract: Amid the surge in generic text-to-video generation, the field of personalized human video generation has witnessed notable advancements, primarily concentrated on single-person scenarios. However, to our knowledge, the domain of two-person interactions, particularly in the context of martial arts combat, remains uncharted. We identify a significant gap: existing models for single-person dancing generation prove insufficient for capturing the subtleties and complexities of two engaged fighters, resulting in challenges such as identity confusion, anomalous limbs, and action mismatches. To address this, we introduce a pioneering new task, Personalized Martial Arts Combat Video Generation. Our approach, MagicFight, is specifically crafted to overcome these hurdles. Given this pioneering task, we face a lack of appropriate datasets. Thus, we generate a bespoke dataset using the game physics engine Unity, meticulously crafting a multitude of 3D characters, martial arts moves, and scenes designed to represent the diversity of combat. MagicFight refines and adapts existing models and strategies to generate high-fidelity two-person combat videos that maintain individual identities and ensure seamless, coherent action sequences, thereby laying the groundwork for future innovations in the realm of interactive video content creation.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: Our work, "MagicFight: Personalized Martial Arts Combat Video Generation," makes significant contributions to the field of multimedia and multimodal processing by introducing a new task and a novel approach that seamlessly integrates multiple modalities. Our model not only generates high-fidelity videos but also intricately weaves together image, video, skeleton data, and text, creating a rich tapestry of multimedia content.
By harnessing the power of Unity, a sophisticated game physics engine, we have crafted a bespoke dataset that captures the essence of martial arts combat through detailed 3D characters and environments. This dataset serves as a foundation for our model to learn the nuanced interactions between visual elements and the underlying dynamics of combat.
Furthermore, our method, MagicFight, is designed to handle the complex interplay between visual modalities and textual descriptions. It translates the textual prompt of scenes into the content of video, ensuring that the generated content is not only visually compelling but also semantically aligned with the intended prompt.
The innovation lies in our model's ability to maintain individual identities within the generated videos, a feat that requires sophisticated understanding and manipulation of image and video modalities. Additionally, the integration of skeleton data allows for precise control over the movements and actions of the characters, adding another layer of depth to the multimodal processing.
Supplementary Material: zip
Submission Number: 618
Loading