Keywords: computer vision, correspondence, pointmap prediction
TL;DR: We present a dataset of floor plan-photo pairs from the Internet with pixel correspondences and camera poses, adapt DUSt3R to improve correspondence prediction, and identify systematic errors for future work.
Abstract: Geometric models like DUSt3R have shown great advances in understanding the geometry of a scene from pairs of photos. However, they fail when the inputs are from vastly different viewpoints (e.g., aerial vs.\ ground) or modalities (e.g., photos vs.\ abstract drawings) compared to what was observed during training. This paper addresses a challenging version of this problem: predicting correspondences between ground-level photos and floor plans. Current datasets for joint photo--floor plan reasoning are limited, either lacking in varying modalities (VIGOR) or lacking in correspondences (WAFFLE). To address these limitations, we introduce a new dataset, C3, created by first reconstructing a number of scenes in 3D from Internet photo collections via structure-from-motion, then manually registering the reconstructions to floor plans gathered from the Internet, from which we can derive correspondence between images and floor plans. C3 contains 90K paired floor plans and photos across 597 scenes with 153M pixel-level correspondences and 85K camera poses. We find that state-of-the-art correspondence models struggle on this task. By training on our new data, we can improve on the best performing method by 34\% in RMSE. We also identify open challenges in cross-modal geometric reasoning that our dataset aims to help address. Our project website is available at: \url{https://c3po-correspondence.github.io/}.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/kwhuang/C3
Code URL: https://github.com/c3po-correspondence/C3Po
Supplementary Material: zip
Primary Area: Datasets & Benchmarks for applications in computer vision
Submission Number: 2406
Loading