Keywords: Video Generation, Video Customization
Abstract: Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose OmniCustom, a multi-modal customized video generation model that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, OmniCustom introduces an identity-enhanced text-image conditioning module based on LLaVA for improved multi-modal understanding, and an image ID enhancement module that leverages temporal concatenation to reinforce identity features. To enable flexible audio- and video-driven customization, we further propose modality-specific injection modules. Our identity-disentangled AudioNet injects temporally aligned audio features into video latents via spatial cross-attention, enabling precise audio control. For video-driven generation, we design an identity-disentangled video injection module that projects conditional video into the latent space and efficiently aligns video features with latents for seamless integration. Extensive experiments on single- and multi-subject scenarios show that OmniCustom significantly outperforms state-of-the-art methods in ID consistency, realism, and text-video alignment. We further demonstrate its robustness on downstream tasks such as audio- and video-driven customized video generation, highlighting the effectiveness of our multi-modal conditioning and identity-preserving strategies for customized video generation.
Primary Area: generative models
Submission Number: 11394
Loading