Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track spotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: egocentric vision, goal inference, wearable agents, benchmarks, egocentric video understanding, multi-modal, egocentric
TL;DR: First multi-modal (vision, audio, digital context, longitudinal) scripted dataset and benchmark for goal inference for wearable assistant agents.
Abstract: There has recently been a surge of interest in Wearable Assistant Agents: agents embodied in a wearable form factor such as smart glasses, who can take actions toward a user’s stated goal — a high-level language-expressed command such as “where did I leave my keys?”, “Text Alice I will be late”, or “What’s the weather in Cancun?”. In this work, we consider the complementary problem of eliminating the effort required to interact with such an agent by proactively inferring the user’s goal from multimodal contextual observations. As vision-language models (VLMs) hold strong potential to ultimately solve this problem, our work focuses on creating a strong benchmark to measure progress toward this end. Given the limited prior work in this area, establishing the benchmark required collecting a novel multimodal goal-inference dataset; our dataset comprises ~30 hours of data from 363 participants across 3,482 recordings, featuring ground-truth reference goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We ran a human predictability study, where we found that humans set a strong baseline that comprises a de facto upper bound on model performance: they show multiple choice question (MCQ) accuracy of 93%, with the best VLM achieving about 84% accuracy. However, MCQ assesses discrimination, not the model’s ultimate task of generating the goal through open-ended text generation. Through a meta-evaluation, we find that a VLM judging the generated goals is as good as a human judge if it has access to a human-authored script of the video or a correct reference goal. Finally, we evaluate several families of modern vision-language models on the benchmark, showing that larger models have a significant performance advantage, but are still far from being practically useful, as they produce relevant goals only ~57% of the time. The best-performing smaller models—whose size makes them better suited to wearable applications—perform significantly worse than their counterparts, generating ~49% accuracy on the benchmark. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities, but don’t gain as much when noisy modalities are included (e.g., in the case of digital context when most of the app state is irrelevant).
Croissant File: zip
Dataset URL: https://dl.fbaipublicfiles.com/wagibench/ob2.tar.gz
Code URL: https://github.com/facebookresearch/WAGIBench
Supplementary Material: zip
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Flagged For Ethics Review: true
Submission Number: 2012
Loading