UniVid: The Open-Source Unified Video Model

Jiabin Luo; Junhui Lin; Zeyu Zhang; Biao Wu; Meng Fang; Ling Chen; Hao Tang

UniVid: The Open-Source Unified Video Model

Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, Hao Tang

14 Sept 2025 (modified: 05 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: unified multimodal video modeling, video generation, video understanding

TL;DR: UniVid unifies video generation and understanding by using MLLM-produced, semantically rich textual tokens to steer video diffusion and a training-light, keyframe-centric Pyramid Reflexion for temporal reasoning.

Abstract: Unified video modeling combining generation and understanding capabilities is increasingly important, yet faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the suboptimality of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of our unified video model, achieving a 2.2% improvement on VBench-Long total score compared to the previous SOTA method EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 4981

Loading