How Do Images Align and Complement LiDAR? Towards a Harmonized Multi-modal 3D Panoptic Segmentation

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose IAL, a new multi-modal framework that aligns and fuses image and LiDAR information at the data, feature, and mask levels to improve 3D panoptic segmentation, achieving state-of-the-art results.
Abstract: LiDAR-based 3D panoptic segmentation often struggles with the inherent sparsity of data from LiDAR sensors, which makes it challenging to accurately recognize distant or small objects. Recently, a few studies have sought to overcome this challenge by integrating LiDAR inputs with camera images, leveraging the rich and dense texture information provided by the latter. While these approaches have shown promising results, they still face challenges, such as misalignment during data augmentation and the reliance on post-processing steps. To address these issues, we propose **I**mage-**A**ssists-**L**iDAR (**IAL**), a novel multi-modal 3D panoptic segmentation framework. In IAL, we first introduce a modality-synchronized data augmentation strategy, PieAug, to ensure alignment between LiDAR and image inputs from the start. Next, we adopt a transformer decoder to directly predict panoptic segmentation results. To effectively fuse LiDAR and image features into tokens for the decoder, we design a Geometric-guided Token Fusion (GTF) module. Additionally, we leverage the complementary strengths of each modality as priors for query initialization through a Prior-based Query Generation (PQG) module, enhancing the decoder’s ability to generate accurate instance masks. Our IAL framework achieves state-of-the-art performance compared to previous multi-modal 3D panoptic segmentation methods on two widely used benchmarks. Code and models are publicly available at https://github.com/IMPL-Lab/IAL.git.
Lay Summary: Autonomous vehicles, such as self-driving cars, need to understand everything around them to drive safely. This research helps these vehicles perceive their environment more clearly by combining camera images with LiDAR, a laser-based 3D sensor. Cameras provide rich details and color information, while LiDAR offers precise distance measurements and object outlines. Leveraging the complementary strengths of these two sensors, we show how images can be aligned with and enhance LiDAR data to identify all objects and surfaces in a scene—a task known as 3D panoptic segmentation. We developed a system called **Image-Assists-LiDAR (IAL)** that integrates information from both sensors in a top-down manner—starting with data alignment, followed by feature fusion, and finally mask generation. First, we ensure that camera and LiDAR data are well-matched and diverse during training. Second, we combine them into a unified representation that is both accurate and comprehensive. Finally, the system uses clues from both sensors to better locate objects, helping it detect more objects that one sensor may fail to catch. This dual-perspective approach leads to safer, more reliable perception in real-world applications, since the machine is far less likely to misidentify or overlook important details.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/IMPL-Lab/IAL
Primary Area: Applications->Computer Vision
Keywords: LiDAR and Image Perception, Multi-sensor, Panoptic Segmentation
Submission Number: 5839
Loading