Epipolar Geometry Improves
Video Generation Models

Anonymous Authors
Anonymous Submission
Improved 3D Consistency

Abstract

Video generation models have progressed tremendously through large latent diffusion transformers trained with rectified flow techniques. Yet, despite these advances, these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes.

3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. This work explores how simple epipolar geometry constraints can improve modern video diffusion models trained on internet-scale datasets. Despite their massive training data, these models often fail to capture the fundamental geometric principles underlying all visual content.

While traditional computer vision methods are often non-differentiable and computationally expensive, they provide reliable, mathematically grounded signals for 3D consistency evaluation. We demonstrate that aligning diffusion models through a preference-based optimization framework using pairwise epipolar geometry constraints yields videos with superior visual quality, enhanced 3D consistency, and significantly improved motion stability.

Method

Our approach bridges modern video diffusion models with classical computer vision algorithms using epipolar geometry constraints as reward signals in a preference-based finetuning framework.

Method Diagram
1
Generate Paired Videos

Generate diverse videos using pretrained generators and compute their epipolar geometry consistency scores using the Sampson distance to identify well-constrained vs. unconstrained examples.

2
Preference-Based Optimization

Train policy using Flow-DPO to prefer geometrically consistent outputs by learning from paired examples ranked by epipolar error, without requiring differentiable rewards.

3
Enhanced Generation

Apply the updated policy to enhance 3D consistency in the base video diffusion model, leading to more stable camera trajectories and reduced artifacts.

Results

Our epipolar-aligned model significantly reduces artifacts and enhances motion smoothness, resulting in more geometrically consistent 3D scenes. To demonstrate our model, we present examples where the generated video from the original model lacks 3D consistency. Below you can toggle between baseline model results and our approach for direct comparison.

Basketball Court
Chinese Temple
Notice how our model preserves consistent 3D structure throughout camera movements, maintaining shape and scale of the temple parts.
House Front Door
The epipolar-aligned model generates stable perspective with fewer artifacts during camera movement.
Residential Complex Exterior
Our epipolar-aligned model maintains structural integrity of the apartment buildings with consistent perspective as the camera moves. Notice how buildings remain geometrically stable, eliminating the warping and distortion seen in the baseline.
Living Room interior
In the baseline model, the sofa warps unnaturally as the camera moves through, creating a distracting perspective distortion. Our approach maintains proper object proportions and shape consistency throughout the camera trajectory, preserving the realistic appearance of furniture and interior elements.
Outdoor Playground Climbing Wall
The baseline model exhibits noticeable distortions, particularly in finer details like climbing holds and netting as the camera moves. Our epipolar alignment preserves the geometric consistency of these small elements and maintains proper perspective relationships between the climbing wall and background trees, creating a more realistic outdoor scene.
Elegant Staircase with Red Carpet
The baseline model shows noticeable distortion in the staircase geometry as the camera moves. Our approach maintains proper perspective, preserving straight lines and consistent architectural features throughout the camera trajectory.
Bustling Historical Marketplace
Even in this highly dynamic scene with numerous people, our method significantly reduces the motion artifacts and temporal inconsistencies seen in the baseline. Building facades maintain proper perspective while moving subjects appear more naturally integrated with the environment, demonstrating our approach's effectiveness beyond just static scenes.
Classic Car Show with Motion
The baseline model struggles with maintaining the scene consistency during camera movement, creating many warping artifacts. Our approach preserves vehicle geometry and spatial relationships between moving cars, demonstrating how epipolar constraints improve even scenes with moving objects.
Mountain Biking Trail Perspective
The baseline model struggles with the combined motion of the cyclist and camera, creating unnatural texture warping and geometric distortions in the rocky terrain. Our approach maintains consistent rock textures and landscape features while preserving the sense of motion, demonstrating effectiveness in first-person action sequences.
Car Driving Through The Dock
The baseline struggles to model the collision of the car with the objects and maintain the smooth trajectory causing unnatural warping and perspective distortions of the objects.

Quantitative Evaluation

We evaluate our approach using VBench metrics, and direct 3D geometry evaluations. For 3D reconstruction (PSNR, SSIM, LPIPS), we measure 3D scene reconstruction quality with Gaussian splatting.

3D Consistency & Reconstruction
Method Sampson Error ↓ Perspective ↑ PSNR ↑ SSIM ↑ LPIPS ↓ Human Eval
Baseline 0.190 0.426 22.32 0.706 0.343 54.1%
Ours 0.131 0.428 23.13 0.729 0.315 71.8%
VBench Metrics
Method Background Consistency Aesthetic Quality Temporal Flickering Motion Smoothness
Baseline 0.930 0.541 0.958 0.981
Ours 0.942 0.551 0.969 0.984

Comparison with Direct Fine-Tuning

We compare our epipolar-aware alignment approach to standard supervised fine-tuning (SFT) on multi-view static datasets. While directly fine-tuning on perfectly 3D-consistent static scenes might seem like a viable alternative, it leads to overfitting to dataset-specific characteristics such as narrow camera trajectory distributions and scene types, ultimately hurting generalization to diverse dynamic content.

LA Street Scene
The model finetuned directly on multi-view data overfits to specific camera trajectories. When combined with significant dynamic content, it produces additional artifacts and fails to preserve object geometry.
Toy Street Scene
The SFT model degrades the baseline performance in some cases. Notice, the camera trajectory resembles average camera trajectory in the training data which leads to additional artifacts on the end of the video.

Limitations and Failure Cases

We present failure cases to provide transparency about the limitations of our approach and areas for future improvement.

Data Mining Pipeline

Our reward model relies on feature matching and epipolar geometry scoring. However, this pipeline can produce false positives (assigning low error to geometrically inconsistent videos) and false negatives (high error to consistent ones) when scenes have repetitive textures, lack distinctive features, or contain significant motion blur.

False Positive: Low Error
This video received a low average error despite significant geometric artifacts in the first part of the video. Near static second part of the video leads to overally low error.
False Negative: High Error
This geometrically consistent video received a low score due to repetitive textures affecting feature tracking.

Generation Failures

While our method significantly improves geometric consistency, it does not solve all failure modes. Complex dynamic scenes with extreme camera motion or highly ambiguous content remain challenging for both the baseline and our approach.

Near Static Outdoor Scenes
The aligned model occasionally generate static frames for near static drone videos of outdoor scenes.
Challenging Dynamic Scenes
Extreme illuminations, dynamics and out of distribution content remain challenging for both baseline and our approach.
Complex Interactions
Complex interactions between objects are not addressed by epipolar alignment.

Conclusion

We presented a novel approach for enhancing 3D consistency in video diffusion models by leveraging classical epipolar geometry constraints as preference signals. Our work demonstrates that aligning modern generative models with fundamental geometric principles can significantly improve the spatial coherence of generated content without requiring complex 3D supervision.

The resulting models generate videos with notably fewer geometric inconsistencies and more stable camera trajectories while preserving creative flexibility. This work highlights how classical computer vision algorithms can effectively complement deep learning approaches, addressing limitations in purely data-driven systems.