PointRep: Multi-representation Fusion Enhancement for Multi-modal 3D Object Detection

Zhenyu Shi, Chen Liu, Kun Shi, Shibo He, Chaojie Gu, Jiming Chen

Published: 2025, Last Modified: 05 Jan 2026RCAR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multi-modal fusion has great potential in enhancing the performance of 3D object detection by leveraging the complementary information provided by different modalities such as image and LiDAR point cloud. The mainstream sequential fusion methods get semantic features from images and project them to point clouds for data-level fusion. They are simple to implement and have low complexity. However, their fusion representation accuracy and richness are restricted, leading to limited performance boost in tough scenarios such as long - range and dense object detection. We propose a multi-representation fusion enhancement method, named PointRep. Specifically, multi-representation includes Semantic Representation (SemRep) and Instance Representation (InsRep). The SemRep combines image semantics and point cloud shape features to predict fine-grained semantic descriptions, and the InsRep extracts center point from top-down view as the unique instance representation for target. Our PointRep can be applied to any point-based 3D object detection model and significantly improve its model performance. We perform comprehensive experiments on the KITTI dataset. Our method has the state-of-the-art performance in comparison with existing sequential fusion methods. Especially in long-distance detection between 50 and 100 meters, we achieve a performance improvement of 10.91%. In crowded object detection, we are able to achieve a 3.5% performance improvement.

External IDs:dblp:conf/rcar/ShiLSHGC25