NeRF as Pretraining at Scale: Generalizable 3D-Aware Semantic Representation Learning from View Prediction
Abstract: Cross-scene generalizable NeRF models, which could directly synthesize novel views using several source views of unseen scenes, are gaining prominence in the NeRF field. Discovering the potential signal of emerging capabilities in existing methods, we draw a parallel between BERT’s "drop-and-predict" Masked Language Model (MLM) pre-training and novel view synthesis (NVS) in generalizable NeRF. In this work, we pioneer the scaling up of NVS as an effective pretraining strategy in a multi-view context. To bolster generalizability in pretraining, we incorporate a large-scale, minimally annotated dataset and proportionally increase the model size, revealing a neural scaling law akin to that observed in BERT. We also introduce innovative hardness-aware training techniques to enhance robust feature learning. Our model, named "NPS", demonstrates remarkable generalizability in both zero-shot and few-shot novel view synthesis. It further shows emergent capabilities in downstream tasks like few-shot multi-view semantic segmentation and depth estimation. Significantly, NPS reduces the necessity of training separate models for each task, underlining its versatility and efficiency. This approach sets a new precedent in the NeRF field, and highlights the vast possibilities opened up by scaling up generalizable novel view synthesis.
Loading