CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation
Abstract: Vision foundation models have revolutionized 2D camera-based perception by ex- tracting generalized features for downstream tasks. Recent work applies self-supervised cross-modal knowledge distillation (KD) to transfer these capabilities to 3D LiDAR mod- els, but often relies on complex losses, pseudo-semantic maps, or limits KD to seman- tic segmentation. We introduce CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework with simple yet effective design choices. Our method uses a direct feature similarity loss and an MLP projection head to capture complex semantic dependencies without relying on pseudo-semantic maps or explicit semantic supervision. Additionally, we enhance the learned knowledge with a self-supervised occupancy prediction task, further improving 3D spatial reasoning. Experiments on autonomous driving benchmarks show that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection, with up to 10% mIoU improvement, particularly when fine tuning with limited data. Additionally, models pretrained with our approch shows extreme robustness towards weather and sensor corruption as well as great domain generalization capabilities.
Loading