Epipolar Geometry Improves Video Generation Models

Orest Kupyn; Fabian Manhardt; Federico Tombari; Christian Rupprecht

Epipolar Geometry Improves Video Generation Models

Orest Kupyn, Fabian Manhardt, Federico Tombari, Christian Rupprecht

09 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: diffusion models, epipolar geometry, latent video diffusion, video generation, 3d geometry, 3d computer vision, direct preference optimization

TL;DR: Improving Large Video Generation Models with classical computer vision algorithms through Flow-DPO objective

Abstract: Video generation models have progressed tremendously through large latent diffusion transformers trained with rectified flow techniques. Yet, despite these advances, these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes. 3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. This work explores how simple epipolar geometry constraints can improve modern video diffusion models trained on internet-scale datasets. Despite their massive training data, these models often fail to capture the fundamental geometric principles underlying all visual content. While traditional computer vision methods are often non-differentiable and computationally expensive, they provide reliable, mathematically grounded signals for 3D consistency evaluation. We demonstrate that aligning diffusion models through a preference-based optimization framework using pairwise epipolar geometry constraints yields videos with superior visual quality, enhanced 3D consistency, and significantly improved motion stability. Our approach offers an efficient alignment strategy that enforces established geometric principles without requiring end-to-end differentiability. Evaluation shows that our method outperforms baseline models and alternative alignment approaches across various metrics. By bridging the gap between data-driven deep learning and classical geometric computer vision, we present a practical method for generating more spatially consistent videos without compromising visual quality or requiring explicit 3D supervision.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 13165

Loading