Learning Robust 3D Representation from CLIP via Dual Denoising

Published: 01 Jul 2025, Last Modified: 09 Jul 2025ICML 2025 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D pre-training, zero-shot adversarial robustness
TL;DR: Distilling 3D representation with zero-shot adversarial robustness from pre-trained CLIP model
Abstract: In this paper, we explore a critical yet under-investigated challenge: whether pre-trained vision language models like CLIP can be adapted for zero-shot adversarial-robust point cloud recognition. Since point clouds are a crucial data format for representing a 3D world, particularly in safety-critical applications, there is a pressing need to develop adversarially robust 3D recognition algorithms due to the inherent vulnerability of deep models to adversarial attacks. Recent advances in vision-language pre-training have endowed point cloud recognition models with powerful zero-shot generalization capacity, leading to a new paradigm for large-scale 3D recognition. This is usually achieved via cross-modal distillation, a scalable approach for multi-modal aware 3D learning. However, current methods primarily rely on direct alignment to map point cloud features to a shared multi-modal feature space, providing no improvement in 3D robustness. This raises a critical question: can both high-performing zero-shot 3D recognition and zero-shot 3D adversarial robustness be achieved in large-scale 3D learning? Our answer is affirmative. In this paper, we propose a novel distillation algorithm designed to learn robust 3D representations from CLIP. It is capable of simultaneously enhancing both zero-shot 3D recognition performance and zero-shot 3D adversarial robustness compared to baseline models. Our approach is built upon two key components, namely robust 3D pre-training and parallel feature denoising. This enables robust and high-performing 3D zero-shot generalization without the dependence on adversarial training, which is often inefficient and prone to overfit. Experiments indicate that our model achieves a 7% improvement with clean input and varying degrees of enhancement with perturbed input, outperforming other models of similar scale on zero-shot 3D recognition benchmarks.
Submission Number: 58
Loading