Keywords: vision-language, pretraining, semantic grounding
TL;DR: We discuss grounding pretrained vision-language features in neural implicit representations.
Abstract: Pretraining deep neural networks has become very popular and has led to the recent trend of foundation models. For perception, pretraining has mostly been constrained to 2D feature learning. 3D representation learning has yet to have its breakthrough moment. Data is more heterogeneous and harder to come by. 3D learning algorithms are still behind their 2D counterparts and the right 3D self-supervised learning objectives are yet to be discovered.
In this paper, we take a look at a recent trend in 3D representation learning where features extracted from 2D images are grounded into a 3D representation through a 3D feature field. We discuss recent results, highlight some open problems in the field and suggest some potential avenues to solve these problems.
0 Replies
Loading