PARSE-Ego4D: Personal Action Recommendation Suggestions for Egocentric Videos

Published: 06 Mar 2025, Last Modified: 21 Mar 2025ICLR 2025 FM-Wild WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: foundation models, large language models, augmented reality, virtual reality, human annotation, recommender systems, perception
TL;DR: We present a dataset with rich human annotations for real-world action suggestions in AR/VR systems based on the Ego4D dataset
Abstract: In the rapidly evolving landscape of AI-driven assistance, foundation models (FMs) play a crucial role in bridging understanding with actionable insights. Existing egocentric video datasets are rich in annotations but lack guidance on actions that intelligent assistants can take. To bridge this gap and enhance the user experience of FMs in the real world, we introduce PARSE-Ego4D, a novel set of personal action recommendation annotations for the Ego4D dataset. We generated over 18,000 context-aware action suggestions using a large language model and filtered these through 33,000 human annotations to ensure high quality and user-centered recommendations. We analyze inter-rater agreement and participant preferences to ground our annotations in real-world applicability. PARSE-Ego4D is designed not only to enhance action suggestion systems but also to support the integration of multiple modalities–such as text and video–in practical applications like augmented and virtual reality. We propose new benchmarks that emphasize personalization, adaptability, and efficiency for AI assistants in-the-wild. Our work highlights the importance of grounding foundation models in-the-wild to facilitate their responsible deployment across varied domains and provides a foundation for researchers and developers working on personalizable AI systems–bridging the gap between FM customization and practical real-world applications.
Submission Number: 74
Loading