Towards Unified Representation of Multi-Modal Pre-training for 3D Processing

Ben Fei, Yixuan Li, Weidong Yang, Lipeng Ma, Ying He

Published: 01 Jan 2025, Last Modified: 21 Jan 2026IEEE Transactions on Visualization and Computer GraphicsEveryoneRevisionsCC BY-SA 4.0
Abstract: With the growing demand in real-world applications, learning to understand 3D data has become increasingly critical for a variety of computer graphics tasks, including shape classification, model retrieval, scene reconstruction, and point cloud completion. While previous methods have explored self-supervised learning within single modalities, such as point clouds or images, the potential of multi-modal supervision remains underexplored due to the lack of aligned and scalable training signals. In this work, we present DR-Point, a tri-modal pre-training framework that jointly learns from RGB images, depth maps, and 3D point clouds to construct a unified embedding space across modalities. By leveraging cross-modal consistency among modality triplets, DR-Point enables effective alignment of 2D-3D features without requiring manual annotations. To further enhance geometric fidelity, we incorporate a differentiable rendering module that synthesizes depth information and refines structural details in reconstructed point clouds. This promotes improved representation of fine-grained surface geometry and spatial correspondence. Extensive experiments across multiple downstream benchmarks demonstrate that DR-Point consistently outperforms existing self-supervised baselines on tasks such as 3D object classification, part segmentation, semantic segmentation, and shape completion. Our results highlight the effectiveness of multi-modal pre-training for comprehensive 3D processing and its potential to benefit a wider range of graphics-related tasks.
Loading