PARSE-Ego4D: Toward Bidirectionally Aligned Action Recommendations for Egocentric Videos

Published: 06 Mar 2025, Last Modified: 05 May 2025ICLR 2025 Bi-Align Workshop ICLROralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: foundation models, large language models, augmented reality, virtual reality, human annotation, recommender systems, perception, alignment
TL;DR: We introduce a dataset of 18,000 context-aware action recommendations, validated by 33,000 human annotations, to drive bidirectional alignment in proactive AI assistants for egocentric video applications
Abstract: Intelligent assistance involves not only understanding but also action. Existing ego-centric video datasets contain rich annotations of the videos, but not of actions that an intelligent assistant could perform in the moment. The suggestion of meaningful and valuable actions to users requires better alignment of the AI assistant to human preferences than for merely understanding user actions. To address this gap, we release **PARSE-Ego4D**, a novel set of personal action recommendation annotations for the Ego4D dataset. We generated over 18,000 context-aware action suggestions using a large language model and filtered these through 33,000 human annotations for AI-to-human alignment. We propose new benchmarks and emphasize personalization, adaptability, and efficiency for AI assistants in the real world. For AI assistants to be most useful, we argue that continual learning and adaptation of the AI to the human user - and vice versa - is necessary. The annotations in PARSE-Ego4D support researchers and developers working towards always-on, bi-directionally aligned action recommendation systems for augmented and virtual reality systems.
Submission Type: Long Paper (9 Pages)
Archival Option: This is an archival submission
Presentation Venue Preference: ICLR 2025
Submission Number: 57
Loading