How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?

Tuan Anh Tran; Duy Minh Ho Nguyen; Hoai-Chau Tran; Michael Barz; Khoa D Doan; Roger Wattenhofer; Vien Anh Ngo; Mathias Niepert; Daniel Sonntag; Paul Swoboda

How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?

Tuan Anh Tran, Duy Minh Ho Nguyen, Hoai-Chau Tran, Michael Barz, Khoa D Doan, Roger Wattenhofer, Vien Anh Ngo, Mathias Niepert, Daniel Sonntag, Paul Swoboda

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: point cloud representation, token merging, transformer models

TL;DR: optimizing 3D point cloud transformer model for large-scale processing

Abstract: Recent advances in 3D point cloud transformers have led to state-of-the-art results in tasks such as semantic segmentation and reconstruction. However, these models typically rely on dense token representations, incurring high computational and memory costs during training and inference. In this work, we present the finding that tokens are remarkably redundant, leading to substantial inefficiency. We introduce an efficient token merging method and illustrate that it can reduce the token count by up to 90–95% while maintaining competitive performance. This finding challenges the prevailing assumption that more tokens inherently yield better performance and highlights that many current models are over-tokenized and under-optimized for scalability. We validate our method across multiple 3D vision tasks and show consistent improvements in computational efficiency. This work is the first to assess redundancy in large-scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures. Our code and checkpoints are publicly available at https://gitmerge3d.github.io.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 23675

Loading