Keywords: Dense Point Map Estimation, Video Depth Estimation, Camera Pose and Intrinsic Estimation, Efficient High Resolution
TL;DR: Building on VGGT, we significantly improve visual geometry estimation by systematically analyzing the effectiveness of training data, training objectives, and efficient high-resolution input.
Abstract: Despite significant advances in feed-forward visual geometry estimation, several core components that underpin model performance remain underexplored. In this work, we systematically dissect the core components and enhance the performance of VGGT. Our ablation experiments reveal that: 1) Data diversity remains a more impactful factor for accuracy than data quality; 2) The widely adopted confidence-aware loss and spatial gradient loss can unexpectedly degrade performance. We further evaluate the effectiveness of several existing techniques, demonstrating that sequence-level and frame-level alignment improve overall performance, while local region alignment unexpectedly brings a performance drop. In addition, we propose two enhancements: a consistency loss that enforces coherence among depth maps, camera parameters, and point maps; and an efficient architectural adaptation that enables high-resolution visual geometry estimation. These insights and improvements are integrated into CARVE, a model that jointly predicts accurate and consistent geometry from arbitrary input views with high resolutions. Extensive experiments for point cloud estimation, video depth estimation, and camera pose and intrinsic estimation across diverse benchmarks demonstrate that CARVE achieves state-of-the-art performance in visual geometry estimation.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6776
Loading