Cameras as Rays: Pose Estimation via Ray Diffusion

Jason Y. Zhang; Amy Lin; Moneish Kumar; Tzu-Hsuan Yang; Deva Ramanan; Shubham Tulsiani

Cameras as Rays: Pose Estimation via Ray Diffusion

Jason Y. Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, Shubham Tulsiani

Published: 16 Jan 2024, Last Modified: 21 Apr 2024ICLR 2024 oralEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: 3D Computer Vision, Pose Estimation, Diffusion

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Over-parameterize camera as a bundle of rays, which is a representation that can be predicted using a denoising diffusion model.

Abstract: Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparsely sampled views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatial image features improving pose precision. We observe that this representation is naturally suited for set-level transformers and develop a regression-based approach that maps image patches to corresponding rays. To capture the inherent uncertainties in sparse-view pose inference, we adapt this approach to learn a denoising diffusion model which allows us to sample plausible modes while improving performance. Our proposed methods, both regression- and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D while generalizing to unseen object categories and in-the-wild captures.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: representation learning for computer vision, audio, language, and other modalities

Submission Number: 2052

Loading