ShapeGPT: 3D Shape Generation With a Unified Multi-Modal Language Model

Fukun Yin, Xin Chen, Chi Zhang, Biao Jiang, Zibo Zhao, Wen Liu, Gang Yu, Tao Chen

Published: 01 Jan 2025, Last Modified: 05 Nov 2025IEEE Transactions on MultimediaEveryoneRevisionsCC BY-SA 4.0
Abstract: The advent of large language models, which enable flexibility through instruction-driven approaches, has revolutionized many traditional generative tasks, but large models for 3D data, particularly in comprehensively handling 3D shapes with other modalities, are still under-explored. By achieving instruction-based shape generation, versatile multi-modal generative shape models can significantly benefit various fields, such as 3D virtual construction and network-aided design. In this article, we present ShapeGPT, a shape-included multi-modal framework to leverage strong pre-trained language models to address multiple shape-relevant tasks. Specifically, ShapeGPT employs a “word-sentence-paragraph” framework to discretize continuous shapes into shape words, further assembles these words into shape sentences, and integrates shape with instructional text for multi-modal paragraphs. To learn this shape-language model, we use a three-stage training scheme, including shape representation, multi-modal alignment, and instruction-based generation, to align shape-language codebooks and learn the intricate correlations among these modalities. Extensive experiments demonstrate that ShapeGPT achieves comparable performance across shape-relevant tasks, including text-to-shape, shape-to-text, shape completion, and shape editing.
Loading