Multimodal LiDAR-Camera Novel View Synthesis with Unified  Pose-free  Neural  Fields

Weiyi Xue; Fan Lu; Yunwei Zhu; Zehan Zheng; Haiyun Wei; Sanqing Qu; Jiangtong Li; Ya Wu; Guang Chen

Multimodal LiDAR-Camera Novel View Synthesis with Unified Pose-free Neural Fields

Weiyi Xue, Fan Lu, Yunwei Zhu, Zehan Zheng, Haiyun Wei, Sanqing Qu, Jiangtong Li, Ya Wu, Guang Chen

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal, Pose-free, Novel View Synthesis

Abstract: Pose-free Neural Radiance Field (NeRF) aims at novel view synthesis (NVS) without relying on accurate poses, exhibiting significant practical value. Image and LiDAR point cloud are two pivotal modalities in autonomous driving scenarios. While demonstrating impressive performance, single-modality pose-free NeRFs often suffer from local optima due to the limited geometric information provided by dense image textures or the sparse, textureless nature of point clouds. Although prior methods have explored the complementary strengths of both modalities, they have only leveraged inherently sparse point clouds for discrete, non-pixel-wise depth supervision, and are limited to NVS of images. As a result, a Multimodal Unified Pose-free framework remains notably absent. In light of this, we propose MUP, a pose-free framework for LiDAR-Camera joint NVS in large-scale scenes. This unified framework enables continuous depth supervision for image reconstruction using LiDAR-Fields rather than discrete point clouds. By leveraging multimodal inputs, pose optimization receives gradients from the rendering loss of point cloud geometry and image texture, thereby alleviating the issue of local optima commonly encountered in single-modality pose-free tasks. Moreover, to further guide pose optimization of NeRF, we propose a multimodal geometric optimizer that leverages geometric relations from point clouds and photometric regularization from adjacent image frames. Besides, to alleviate the domain gap between modalities, we propose a multimodal-specific coarse-to-fine training approach for unified, compact reconstruction. Extensive experiments on KITTI-360 and NuScenes datasets demonstrate MUP's superiority in accomplishing geometry-aware, modality-consistent, and pose-free 3D reconstruction.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 16949

Loading