FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow

The dependence on SfM-computed camera poses prohibits 3D representation learning at-scale. In this work, we train 3D scene representations without camera poses. Our method works robustly, even succeeding on the challenging CO3D dataset, on which classical SfM methods struggle. Our key innovation is a camera pose formulation which leverages the robustness of optical flow methods. Specifically, we lift optical flow into scene flow via differentiable rendering, and differentiably solve for camera pose via a weighted Procrustes formulation. Our method is only supervised by optical flow and re-rendering losses.

TL;DR: We propose to train generalizable 3D scene representations without known camera poses


Below we show generalizable pose estimation followed by generalizable view synthesis on top of a smoothened and wobbled trajectory. Since our model estimates poses and geometry on short video clips, we apply both our pose estimation and view synthesis on sliding windows of the video and trajectory. Our model predicts poses at ~20fps.

CO3D Hydrants


CO3D 10-Category


KITTI


RealEstate10K


Limitations


Our method does not model dynamics, does not robustly predict intrinsics, has no loop closure mechanism, and operates on relatively short clips.

This webpage template was recycled from here.