MuSeLLM: SDF Generation and Understanding via Multi-Scale Tokenization with Position-Aware Guidance

Published: 01 Jan 2025, Last Modified: 12 Nov 2025ICMR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The advancement of 3D generation and understanding stands as a critical interface enabling machines to interpret and interact with the physical world. Large language models, as trained on internet-scale text corpus, are well recognized to possess commonsense of real world and perception in real 3D space. However, when it comes to leveraging these abilities into 3D understanding and generation, the difference of data format is a significant gap between the two ends. In this sense, we propose MuSeLLM, a specially designed adapting strategy for finetuning LLMs on a specially selected data format, namely voxelized SDFs with respect to its easiness of tokenization. Based on this, we put forward a multi-scale tokenization strategy, allowing a coarse to fine generation paradigm on different levels of codebook tokens. This serves as a natural fit to prevailing LLM architectures where the generation of tokens at a certain level may refer to the entire shape representations at coarser levels. To avoid overfitting on spatial orders of tokens, we propose a position-aware guidance which perturbs the generation order of tokens in each level during the training stage, serving as a strong data augmentation strategy for adapting LLMs to 3D domains with few 3D data available. Experiment results in text-guided 3D generation and 3D object understanding illustrate that our method achieves superior performance against previous state-of-the-art with the same training data, as the inherent spatial reasoning ability has been triggered by our method design.
Loading