How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: efficient transformer, token reduction, 3D point cloud transformer
TL;DR: a over-used token phenomenon of 3D point cloud transformer
Abstract: Recent advances in 3D point cloud transformers have led to state-of-the-art results in tasks such as semantic segmentation and reconstruction. However, these models typically rely on dense token representations, incurring high computational and memory costs during training and inference. In this work, we present an efficient token merging strategy that drastically reduces the token count by up to 90–95\% while preserving competitive performance. Our approach estimates token importance by leveraging spatial structures within the 3D point cloud, enabling aggressive token reduction with minimal degradation in accuracy. This finding challenges the prevailing assumption that more tokens inherently yield better performance and highlights that many current models are over-tokenized and under-optimized for scalability. We validate our method across multiple 3D vision tasks and show consistent improvements in computational efficiency. Our ongoing work will release code and detailed benchmarks to support reproducibility and further system-level exploration of efficient foundation models for 3D data.
Submission Number: 118
Loading