Abstract: Recently, point cloud data has attracted the attention of researchers as a promising data representation model for a wide range of applications. As unlike 2D data, point clouds are unordered, irregular, and often large in scale, they might impose severe challenges when designing deep learning models. Over the past decade, substantial progress has been made in proposing architectures that address permutation invariance, geometric reasoning, scalability, and robustness, leading to rapid expansion across diverse 3D data oriented applications. The main aims of this paper are to present a comprehensive survey on existing literature and to analyze how different 3D representations have shaped the design and performance of deep learning models. In contrast to prior surveys that have emphasized on limited task subsets or specific model families, this survey reviews deep point cloud models through representation- and architecture-centric perspective. As such, beyond (1) core tasks such as classification, segmentation, detection, tracking, this survey systematically provides insight into recent progress in broader directions, including (2) geometric modeling, alignment, and pose estimation, (3) foundation models and scene understanding, and (4) robustness, generalization, and reliability. Furthermore, this survey presents commonly used datasets and evaluation metrics, and finally summarizes challenges and future directions toward robustness, efficiency, and generalizability of 3D point cloud systems.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=0hHrJeeeEG
Changes Since Last Submission: Dear Action Editor, Thank you for the constructive comments and suggestions. We have substantially revised the manuscript to improve the survey’s clarity, focus, and organization by clarifying its central theme, restructuring the content into coherent task families, adding discussion of point cloud representations and architectures, removing less related topics, and revising the evaluation section. We believe the revised manuscript is more structured, focused, and useful.
Comment 1: “This survey seems to be a mish-mash of a lot of things but doesn’t have a central theme. It’s hard to understand how a reader will gain additional insights by reading this survey. It touches upon so many topics, but didn’t really devote enough time to clearly depict any of them. ”
Response: We thank the Action Editor for this helpful comment. We agree that the previous version did not make the central theme and organization of the survey sufficiently clear. In the revised manuscript, we clarified that the survey aims to provide a comprehensive yet structured overview of deep learning methods for 3D point clouds. The paper is now organized around coherent task families and a representation- and architecture-centric perspective, helping readers better understand the relationships among tasks, models, and design choices.
Changes: We reorganized the manuscript into coherent task families rather than presenting many topics as independent sections. The revised structure now includes: 1) core tasks; 2) geometric modeling, alignment, and pose estimation; 3) foundation models and scene understanding; 4) robustness and generalization; 5) evaluation and benchmarking; and 6) open challenges and conclusions. We also added a dedicated section on point cloud representations and architectural paradigms, covering point-based, voxel-based, projection-based, graph-based, token-based, foundation-model-based, and hybrid representations, as well as major architectures such as MLP-based, convolution-based, graph neural network, transformer-based, diffusion-based, multimodal, and foundation-model architectures. In addition, we added short introductory paragraphs to the major task-family sections to explain the rationale behind each grouping and revised the conclusion, challenges, and future directions to better summarize current limitations and promising research directions.
Comment 2: “Some topics have hardly any relationship with point cloud deep learning such as SfM or even 3D tracking.”
Response: We agree that some topics in the previous version, such as SfM and parts of 3D tracking, were not sufficiently connected to deep point cloud learning. These topics were originally included as background, but we recognize that they should remain only when directly relevant to point cloud data, representations, or learning-based methods.
Changes: We revised the manuscript to better focus on point cloud deep learning. Sections mainly related to classical or general 3D vision pipelines were removed or reduced, and only brief contextual references were retained where they help explain the motivation or development of learning-based point cloud methods.
Comment 3: “A result table was shown, but results are first incomplete, and then presented with different metrics for different methods. This haphazard approach makes it difficult for the reader to gain any insight.”
Response: We agree that the previous result table was unclear because it mixed incomplete results and different metrics across methods. In the revised manuscript, we limited numerical comparisons to the core tasks where standard benchmarks and metrics allow more meaningful comparison, and we organized the results separately by task and dataset.
For example, in 3D detection, the most common benchmarks are KITTI, nuScenes, and Waymo. From our review of 40 recent detection papers, 13 reported results on KITTI, 25 on nuScenes, and 29 on Waymo. Since nuScenes and Waymo were the most frequently used datasets, we reported detection results mainly on these two benchmarks. We did not include KITTI in the main table because doing so made the table sparse and introduced incompatible settings, such as different difficulty levels, object classes, and evaluation protocols.
Changes: We revised the evaluation and benchmarking section to clearly explain why performance analysis is limited to classification, segmentation, and detection, and why specific datasets and metrics were selected. These tasks were chosen because they have widely used benchmarks and commonly reported compatible metrics, which allow more meaningful comparison. We also clarified that all numerical values are taken from the original papers rather than reproduced through re-training, since reproducing all methods is impractical due to missing code, different preprocessing pipelines, hyperparameter settings, training protocols, hardware requirements, and high computational cost.
Assigned Action Editor: ~Stephen_Lin1
Submission Number: 9616
Loading