DreamActor-M2: Unleashing Pre-trained Video Models for Universal Character Image Animation via In-Context Fine-tuning

Mingshuang Luo; Shuang Liang; Yuxuan Luo; Zhengkun Rong; Tianshu Hu; RuiBing Hou; Hong Chang; Yong Li; Yuan Zhang; Mingyuan Gao

DreamActor-M2: Unleashing Pre-trained Video Models for Universal Character Image Animation via In-Context Fine-tuning

Mingshuang Luo, Shuang Liang, Yuxuan Luo, Zhengkun Rong, Tianshu Hu, RuiBing Hou, Hong Chang, Yong Li, Yuan Zhang, Mingyuan Gao

03 Sept 2025 (modified: 23 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: human animation generation, diffusion model, controllable video generation

Abstract: Character image animation aims to generate high-fidelity videos from a reference image and a driving video, with broad applications in digital humans. Despite recent advances, current methods suffer from two key limitations: reliance on auxiliary pose encoders introduces modality gaps that weaken alignment pre-trained generative priors, and dependence on explicit pose signals severely limits generalization beyond human-centric scenarios. We propose DreamActor-M2, a universal framework that redefines motion conditioning through an in-context LoRA fine-tuning paradigm. By directly concatenating motion signals and reference images into a unified input, our approach preserves the backbone’s native modality and fully exploits pre-trained capabilities without architectural modifications, enabling plug-and-play motion control consistent with the principles of in-context learning. Furthermore, we extend this formulation beyond pose-driven control to an end-to-end framework that conditions directly on raw video frames, trained by a synthesis-driven data generation pipeline. Extensive experiments demonstrate that DreamActor-M2 achieves state-of-the-art performance with superior fidelity, controllability, and cross-domain generalization, marking a significant step toward more flexible and scalable motion-driven video generation.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 1410

Loading