Ego3DT: Tracking Every 3D Object in Ego-centric Videos

Shengyu Hao; Wenhao Chai; Zhonghan Zhao; Meiqi Sun; Wendi Hu; Jieyang Zhou; Yixian Zhao; Qi Li; Yizhou Wang; Xi Li; Gaoang Wang

Ego3DT: Tracking Every 3D Object in Ego-centric Videos

Shengyu Hao, Wenhao Chai, Zhonghan Zhao, Meiqi Sun, Wendi Hu, Jieyang Zhou, Yixian Zhao, Qi Li, Yizhou Wang, Xi Li, Gaoang Wang

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The growing interest in embodied intelligence has brought ego-centric perspectives to contemporary research. One significant challenge within this realm is the accurate localization and tracking of objects in ego-centric videos, primarily due to the substantial variability in viewing angles. Addressing this issue, this paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video. We present Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment. Utilizing information from adjacent video frames, Ego3DT dynamically constructs a 3D scene of the ego view using a pre-trained 3D scene reconstruction model. Additionally, we have innovated a dynamic hierarchical association mechanism for creating stable 3D tracking trajectories of objects in ego-centric videos. Moreover, the efficacy of our approach is corroborated by extensive experiments on two newly compiled datasets, with 1.04× - 2.90× in HOTA, showcasing the robustness and accuracy of our method in diverse ego-centric scenarios.

Primary Subject Area: [Experience] Multimedia Applications

Secondary Subject Area: [Content] Multimodal Fusion

Relevance To Conference: This research notably advances multimedia and multimodal processing by introducing a robust method for 3D object tracking and scene reconstruction in ego-centric videos. By providing a robust method for 3D object tracking in ego-centric videos, this research enhances the capability of multimedia systems to recognize and track objects in dynamic, real-world scenarios. This is particularly relevant for applications where context and environmental interaction play crucial roles, such as in augmented reality (AR) and virtual reality (VR) environments.The development of a zero-shot approach for dynamic 3D scene reconstruction from ego-centric videos marks a substantial advancement in multimedia processing. It enables a more immersive and interactive experience in multimedia applications by allowing for more accurate and realistic rendering of 3D environments based on real-world video input.

Supplementary Material: zip

Submission Number: 1478

Loading