Aria-NeRF: Multimodal Egocentric View Synthesis

Jiankai Sun; Jianing Qiu; Chuanyang Zheng; John Tucker; Javier Yu; Mac Schwager

Aria-NeRF: Multimodal Egocentric View Synthesis

Jiankai Sun, Jianing Qiu, Chuanyang Zheng, John Tucker, Javier Yu, Mac Schwager

Published: 15 May 2024, Last Modified: 15 May 2024WIHR 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal, NeRF

Abstract: We seek to accelerate research in developing rich, multimodal scene models trained from egocentric data, based on differentiable volumetric ray-tracing inspired by Neural Radiance Fields (NeRFs). The construction of a NeRF-like model from an egocentric image sequence plays a pivotal role in understanding human behavior and holds diverse applications within the realms of VR/AR. Such egocentric NeRF-like models may be used as realistic simulations, contributing significantly to the advancement of intelligent agents capable of executing tasks in the real-world. The future of egocentric view synthesis may lead to novel environment representations going beyond today's NeRFs by augmenting visual data with multimodal sensors such as IMU for ego-motion tracking, audio sensors to capture the surface texture and human language context, and eye-gaze trackers to infer human attention patterns in the scene. To support and facilitate the development and evaluation of egocentric multimodal scene modeling, we present a comprehensive multimodal egocentric video dataset. This dataset offers a comprehensive collection of sensory data, featuring RGB images, eye-tracking camera footage, audio recordings from a microphone, atmospheric pressure readings from a barometer, positional coordinates from GPS, connectivity details from Wi-Fi and Bluetooth, and information from dual-frequency IMU datasets (1kHz and 800Hz) paired with a magnetometer. The dataset was collected with the Meta Aria Glasses wearable device platform. We evaluated two baseline NeRF-based models, Nerfacto and NeuralDiff, on our dataset. While they were capable of producing reasonable visual reproduction of the scene, our findings also highlight opportunities for further improvement using a variety of sensing modalities beyond vision. The diverse data modalities and the real-world context captured within this dataset serve as a robust foundation for furthering our understanding of human behavior and enabling more immersive and intelligent experiences in the realms of VR, AR, and robotics.

Submission Number: 1

Loading