AViON4D: Audio-Visual Open-Vocabulary 4D Egocentric Scene Understanding

Published: 14 May 2026, Last Modified: 14 May 2026CVPR 2026 Workshop OpenSUN3D PosterEveryoneRevisionsCC BY 4.0
Keywords: NERF, open-vocabulary, egocentric video, action recognition
Abstract: Egocentric videos offer rich, first-person perspectives on human interactions with the world, enabling applications such as activity recognition and assistive technologies. While recent advances in neural radiance fields (NeRFs) and semantic distillation from vision-language models have enabled open-vocabulary 3D scene understanding, these methods are limited to static scenes and cannot encode the temporal semantics and dynamic human-world interactions essential for egocentric spatiotemporal understanding. We introduce AViON4D, the first multimodal NeRF-based method that extends semantic NeRFs from 3D to 4D for egocentric scenes. AViON4D lifts video features into 4D representations, replacing static image embeddings with temporally-aware features that encode action dynamics. To handle the extreme motion characteristic of egocentric video, we design an object-centric feature extraction strategy that maintains spatial and temporal coherence by tracking objects across frames. Furthermore, AViON4D incorporates audio-language features as a complementary temporal signal via late fusion, helping to disambiguate visually similar actions and refine temporal boundaries through distinctive acoustic cues. Extensive experiments on two complementary 4D open-vocabulary tasks demonstrate that AViON4D outperforms existing single-modal and static approaches by up to +6.0% for action localization and +14.90% for action segmentation. Moreover, for the first time, we show that audio cues from egocentric recordings not only enhance performance but also narrow the gap between single-frame and video-based models, offering a more robust and generalizable solution for 4D scene understanding in egocentric settings. Code will be made available upon acceptance.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Proceedings: true
Submission Number: 1
Loading