everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
We present a novel framework for generalizable dynamic radiance field in egocentric view. Our approach can predict a 3D representation of the physical world at a given time based on a monocular video without test-time training. To this end, we use a contracted triplane as the 3D representation of physical world in an egocentric view at a specific time. To update the explicit 3D representation, we propose a 4D-aware transformer module to aggregate features from monocular videos. Besides, we also introduce a temporal-based 3D constraint to achieve better multiview consistency. In addition, we train the proposed model with large-scale monocular videos in a self-supervised manner. Our model achieves top results in novel view synthesis on dynamic scene datasets, demonstrating its strong understanding of 4D physical world. Besides, our model also shows the superior generalizability to unseen scenarios. Furthermore, we find that our approach emerges capabilities for geometry and semantic learning. We hope our approach can provide preliminary understanding of the physical world in first-person view and help ease future research in computer vision, computer graphics and robotics.