WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Eun Chang; Zhuangqun Huang; Yiwei Liao; Sagar Ravi Bhavsar; Amogh Param; Tammy Stark; Adel Ahmadyan; Xiao Yang; Jiaqi Wang; Ahsan Abdullah; Giang Nguyen; Akil Iyer; David Patrick hall; Elissa Li; Nicolas SCHEFFER; Ahmed Kirmani; Babak Damavandi; Rakesh Wanga; Anuj Kumar; Rohit Patel; Seungwhan Moon; Xin Luna Dong

WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: visual question answering, egocentric understanding, wearable computing, benchmark dataset, Multimodal AI, human-computer interaction, contextual understanding, contextual reasoning, real-world evaluation

TL;DR: WearVQA, the first benchmark specifically designed to evaluate the visual question answering (VQA) capabilities of multi-modal AI assistant on wearable

Abstract: We introduce WearVQA, the first benchmark specifically designed to evaluate the visual question answering (VQA) capabilities of multi-modal AI assistant on wearable devices like smart glasses. Unlike prior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique chal- lenges of ego-centric interaction—where visual inputs may be occluded, poorly lit, unzoomed, or blurry, and questions are grounded in realistic wearable use cases. The benchmark comprises 2,500 carefully curated image-question-answer triplets, spanning 7 diverse image domains including both text-centric and general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning, and 6 common wearables-specific image quality issues. All questions are designed to be answerable using only the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluation framework with 96% labeling accuracy. Open-source and proprietary multi-modal LLMs achieved a QA accuracy as low as 24–52% on WearVQA, with substantial drops on lower-quality images and reasoning- heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark for guiding technicial advancement towards robust, real-world multi-modal wearables AI systems.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/tonyliao-meta/WearVQA

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Flagged For Ethics Review: true

Submission Number: 2119

Loading