Keywords: Computer vision, Point-cloud, Cross-modality.
Abstract: 3D point-clouds and 2D images are different visual representations of the physical world. While human vision can understand both representations, computer vision models designed for 2D image and 3D point-cloud understanding are quite different.
Our paper explores the potential for transferring between these two representations by empirically investigating the feasibility of the transfer, the benefits of the transfer, and shedding light on why the transfer works.
We discovered that we can indeed use the same architecture and pretrained weights of a neural net model to understand both images and point-clouds. Specifically, we can transfer the pretrained image model to a point-cloud model by \textit{inflating} 2D convolutional filters to 3D and then \textbf{f}inetuning the \textbf{i}mage-\textbf{p}retrained models (FIP).
We discover that, surprisingly, models with minimal finetuning efforts --- only on input, output, and optionally batch normalization layers, can achieve competitive performance on 3D point-cloud classification, beating a wide range of point-cloud models that adopt task-specific architectures and use a variety of tricks. When finetuning the whole model, the performance further improves significantly. Meanwhile, we also find that FIP improves data efficiency, achieving up to 10.0 points top-1 accuracy gain on few-shot classification. It also speeds up the training of point-cloud models by up to 11.1x to reach a target accuracy.
One-sentence Summary: With minimal fine-tuning efforts, pretrained-image models can be directly used for point-cloud understanding.
Supplementary Material: zip
14 Replies
Loading