Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Simultaneous Localization and Mapping (SLAM), Neural Radiance Field (NeRF)
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: real-time neural implicit mapping and depth estimation
Abstract: Reconstructing high-quality and real-time dense maps is critical for building the
3D environment for robot sensing and navigation. Recently, Neural Radiance
Field (NeRF) has garnered great attention due to its excellent scene representa-
tion capacity of the 3D world; therefore, recent works leverage NeRF to learn
3D maps, typically based on RGB-D cameras. However, depth sensors are not
always available for all devices, while RGB cameras are cheap and widely appli-
cable. Therefore, we propose to use single RGB input for the scene reconstruction
with NeRF, which becomes highly challenging without geometric guidance from
depth sensors. Moreover, we cultivate its real-time capability with lightweight
implementation. In this paper, we propose FMapping, a factorized NeRF map-
ping framework, allowing for high-quality and real-time reconstruction with only
the RGB input. The insight of our method is that depth doesn’t experience much
change in consecutive RGB frames, thus the geometrical clues can be derived from
RGB effectively with well estimated depth priors. In detail, we divide the map-
ping into 1) the initialization stage and 2) the on-the-fly stage. First, given trackers
are not always stable in the initialization stage, we start with a noisy pose input
to optimize the mapping. To this end, we exploit geometric consistency between
volume rendering and signed distance function in a self-supervised way to cap-
ture depth accurately. In the second stage, given relatively short optimization time
for real-time performance, we model the depth estimation as a Gaussian process
(GP) with a pre-trained cost-effective depth covariance function to promptly infer
depth on the condition of previous frames. Meanwhile, the per-pixel depth esti-
mation and its corresponding uncertainty can guide the NeRF sampling process.
Hence, we propose to densely allocate sample points within adjustable truncation
regions near the surface, and further distribute samples to ones with high uncer-
tainty. This way, we can continue building maps from subsequent poses with sta-
bilized trackers. Experiments demonstrate that our framework outperforms state-
of-the-art RGB-based mapping and achieves comparable performance to RGB-D
mapping in terms of photometric and geometric accuracy, with real-time depth
estimation capability in around 5 Hz with consistent scale.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5930
Loading