Keywords: Multimodal, Pose-free, Novel View Synthesis
Abstract: Pose-free Neural Radiance Field (NeRF) aims at novel view synthesis (NVS) without relying on accurate poses, exhibiting significant practical value. Image and LiDAR point cloud are two pivotal modalities in autonomous driving scenarios. While demonstrating impressive performance, single-modality pose-free NeRFs often suffer from local optima due to the limited geometric information provided by dense image textures or the sparse, textureless nature of point clouds. Although prior methods have explored the complementary strengths of both modalities, they have only leveraged inherently sparse point clouds for discrete, non-pixel-wise depth supervision, and are limited to NVS of images. As a result, a Multimodal Unified Pose-free framework remains notably absent. In light of this, we propose MUP, a pose-free framework for LiDAR-Camera joint NVS in large-scale scenes. This unified framework enables continuous depth supervision for image reconstruction using LiDAR-Fields rather than discrete point clouds. By leveraging multimodal inputs, pose optimization receives gradients from the rendering loss of point cloud geometry and image texture, thereby alleviating the issue of local optima commonly encountered in single-modality pose-free tasks. Moreover, to further guide pose optimization of NeRF, we propose a multimodal geometric optimizer that leverages geometric relations from point clouds and photometric regularization from adjacent image frames. Besides, to alleviate the domain gap between modalities, we propose a multimodal-specific coarse-to-fine training approach for unified, compact reconstruction. Extensive experiments on KITTI-360 and NuScenes datasets demonstrate MUP's superiority in accomplishing geometry-aware, modality-consistent, and pose-free 3D reconstruction.
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 16949
Loading