PAPNet: Point-Enhanced Attention-Aware Pillar Network for 3D Object Detection in Autonomous Driving

Ruitong Li, Yuenan Zhao, Xiaoyu Xu, Jiaming Chen, Ran Song, Wei Zhang

Published: 01 Jan 2026, Last Modified: 28 Feb 2026IEEE Transactions on Automation Science and EngineeringEveryoneRevisionsCC BY-SA 4.0

Abstract: The conversion of raw point clouds into pillar representations has been widely adopted for 3D object detection. Such conversion allows a point cloud to be discretized into structured grids, which enables more efficient spatial representation and faster processing in real-time autonomous driving systems. However, discretizing raw point clouds often leads to the misdetection of small objects such as pedestrians and cyclists. This is because the discretization inevitably results in the loss of contextual and multi-resolution information within raw point clouds. To address this issue, we propose PAPNet, a point-enhanced attention-aware pillar network mainly composed of a point-pillar cross-attention module (PCM), a pillar-wise dual attention module (PDAM), and a multi-resolution set abstraction module (MSAM). PCM integrates raw point cloud features with pillar features across different dimensions, and PDAM guides PAPNet to focus on the intrinsic characteristics of the pillars. Additionally, MSAM retains both high-resolution and low-resolution features while integrating multi-scale information. Extensive experiments on four public datasets and in real-world scenarios demonstrate the effectiveness and efficiency of PAPNet. Codes, data, and demo videos can be found at the project website https://vsislab.github.io/PAPNet/ Note to Practitioners—This research aims to address the issue that existing pillar-based methods tend to overlook small objects such as pedestrians and cyclists in autonomous driving. This approach applies advanced deep learning techniques, achieving higher 3D object detection accuracy with shorter processing time compared to most state-of-the-art methods. Specifically, the proposed method leverages a feature network to encode information from raw point clouds for object detection. Additionally, it takes into account features across multiple resolutions. Our method has been demonstrated through experiments on four different datasets and in real-world scenarios.

External IDs:doi:10.1109/tase.2026.3653431