CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback

Wenhang Ge; Guibao Shen; Jiawei Feng; Luozhou Wang; Hao LU; Xingye Tian; Xin Tao; Pengfei Wan; Ying-Cong Chen

CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback

Wenhang Ge, Guibao Shen, Jiawei Feng, Luozhou Wang, Hao LU, Xingye Tian, Xin Tao, Pengfei Wan, Ying-Cong Chen

08 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video generation; 3D scene exploration; Reward feedback learning

Abstract: Recent advancements in camera-controlled video diffusion models have significantly improved video-camera alignment and enabled more accurate 3D scene generation, driven by potential downstream applications such as virtual reality. However, we reveal that existing approaches often struggle to precisely adhere to the given camera conditions, leading to inconsistencies in the 3D geometry. Inspired by Reward Feedback Learning in diffusion models, which has demonstrated strong potential in aligning model outputs with task-specific objectives, we build upon this paradigm and aim to further improve camera controllability. Directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce a camera-aware 3D decoder that efficiently decodes video latent into 3D representations for reward computation. Specifically, we project the video latent and camera pose into 3D Gaussians, which supports efficient rendering from arbitrary views. In this process, the camera pose not only acts as an input variable but also serves as a projection parameter for determining the mean of each Gaussian. If the generated video does not match the camera conditions, the 3D structure becomes geometrically inconsistent, leading to blurry rendered images. Based on this property, we explicitly optimizing pixel-level consistency between rendered novel views and ground-truth ones as reward feedback. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on the RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method in enhancing both camera controllability and generation quality.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 2989

Loading