HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

10 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY-SA 4.0
Keywords: Hand-Object Interaction, VideoQA, Vision and Language
TL;DR: We introduce a new video QA benchmark for evaluating video models’ ability to recognize fine-grained hand-object interactions.
Abstract: Hand-Object Interaction (HOI) is inherently a dynamic process, involving nuanced spatial coordination, diverse manipulation styles, and influences on interacting objects. However, existing HOI benchmarks tend to emphasize high-level action recognition and hand/object localization while neglecting the fine-grained aspects of hand-object dynamics. We introduce HanDyVQA, a video question-answering benchmark for understanding the fine-grained spatiotemporal dynamics in hand-object interactions. HanDyVQA consists of six types of questions (Action, Process, Objects, Location, State Change, and Object Parts), totaling 11.7k multiple-choice question-answer pairs and 11k instance segmentations that require discerning fine-grained action contexts, hand-object movements, and state changes caused by manipulation. We evaluated several video foundation models on our benchmark and found that even the powerful Qwen2.5-VL-72B reached only 68.8% average accuracy, uncovering new challenges in component-level geometric and semantic understanding through extensive analyses.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/aist-cvrt/HanDyVQA
Code URL: https://github.com/MasaTate/HanDyVQA
Supplementary Material: zip
Primary Area: Datasets & Benchmarks for applications in computer vision
Submission Number: 1309
Loading