Keywords: generative and discrimative pretrain, monocular geometry estimation, benchmarks
TL;DR: Benchmarking and Analysizing Monocular Generative and Discriminative Geometry Estimation Models
Abstract: Recent advances in discriminative and generative pretraining have yielded geometry estimation foundation models with strong generalization capabilities. While most discriminative monocular geometry estimation methods rely on large-scale finetuning data to achieve zero-shot generalization, several generative-based paradigms show the potential of achieving impressive generalization performance on unseen scenes by leveraging pre-trained diffusion models and fine-tuning on even a small-scale of synthetic training data. Frustratingly, these models are trained with different recipes on different datasets, making it hard to find out the critical factors that determine the evaluation performance. To resolve the above issue, (1) we build fair and strong baselines in a unified codebase for evaluating and analyzing the state-of-the-art (SOTA) geometry estimation models from pre-training style, finetuning data, and model architecture perspectives; (2) we thoroughly evaluate geometry models on challenging benchmarks with diverse scenes and high-quality annotations. Under the fair training and evaluation configuration, our results reveal that stochastic diffusion-based protocol is not optimal for fine-tuning generative-based geometry estimation methods. One-step finetuning and inference protocol is sufficient for generative-based depth and surface normal estimation. Besides, we find that both discriminative and generative pretraining can generalize well under small-scale fine-tuning high-quality data in scale-invariant depth estimation task. DINOv2-pretrained discriminative models achieve slightly higher performance than generative counterparts with the same small amount of synthetic data. Furthermore, we have observed that metric depth estimation requires significantly more finetuning data than scale-invariant depth estimation for learning the depth scale distribution. We hope this work will inspire future geometry estimation research in building more high-quality fine-tuning datasets and designing more powerful geometry estimation models.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3351
Loading