MoCa: Modeling Object Consistency for 3D Camera Control in Video Generation

Zhijing Cheng; Xuancheng Zhang; Donglin Di; Chen Wei; Hao Li; Xun Yang

MoCa: Modeling Object Consistency for 3D Camera Control in Video Generation

Zhijing Cheng, Xuancheng Zhang, Donglin Di, Chen Wei, Hao Li, Xun Yang

Published: 26 Jan 2026, Last Modified: 16 May 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Text-To-Video, Camera-Control, Video Generation, Generative Model

TL;DR: We propose MoCa, a framework that enables precise camera control in text-to-video generation by modeling object view, appearance, and motion consistency to bridge 2D pixels and 3D scenes without explicit 3D supervision.

Abstract: Camera control is important in text-to-video generation for achieving realistic scene navigation and view synthesis. This control is defined by parameters that describe movement through 3D space, thereby introducing 3D consistency into the generation process. A core challenge for existing methods is achieving 3D consistency within the 2D pixel domain. Strategies that directly integrate camera conditions into text-to-video models often produce artifacts, while those relying on explicit 3D supervision face challenges with generalization. Both limitations originate from the gap between the 2D pixel space and the underlying 3D world. The key insight is that the projection of a smooth 3D camera movement produces consistency in object view, appearance, and motion across 2D frames. Inspired by this insight, we propose MoCa, a dual-branch framework that bridges this gap by modeling object consistency to implicitly learn 3D relationships between the camera and the scene. To ensure view consistency, we design a Spatial-Temporal Camera Encoder with Plücker embedding, which encodes camera trajectories into a geometrically grounded latent representation. For appearance consistency, we introduce a semantic guidance strategy that leverages persistent vision-language features to maintain object identity and texture across frames. To address motion consistency, we propose an object-aware motion disentanglement mechanism that separates object dynamics from global camera movement, ensuring precise camera control and natural object motion. Experiments show that MoCa achieves accurate camera control while preserving video quality, offering a practical and effective solution for camera-controllable video generation.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 10931

Loading