HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

Masatoshi Tateno; Gido Kato; Kensho Hara; Hirokatsu Kataoka; Yoichi Sato; Takuma Yagi

HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

Masatoshi Tateno, Gido Kato, Kensho Hara, Hirokatsu Kataoka, Yoichi Sato, Takuma Yagi

10 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY-SA 4.0

Keywords: Hand-Object Interaction, VideoQA, Vision and Language

TL;DR: We introduce a new video QA benchmark for evaluating video models’ ability to recognize fine-grained hand-object interactions.

Abstract: Hand-Object Interaction (HOI) is inherently a dynamic process, involving nuanced spatial coordination, diverse manipulation styles, and influences on interacting objects. However, existing HOI benchmarks tend to emphasize high-level action recognition and hand/object localization while neglecting the fine-grained aspects of hand-object dynamics. We introduce HanDyVQA, a video question-answering benchmark for understanding the fine-grained spatiotemporal dynamics in hand-object interactions. HanDyVQA consists of six types of questions (Action, Process, Objects, Location, State Change, and Object Parts), totaling 11.7k multiple-choice question-answer pairs and 11k instance segmentations that require discerning fine-grained action contexts, hand-object movements, and state changes caused by manipulation. We evaluated several video foundation models on our benchmark and found that even the powerful Qwen2.5-VL-72B reached only 68.8% average accuracy, uncovering new challenges in component-level geometric and semantic understanding through extensive analyses.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/aist-cvrt/HanDyVQA

Code URL: https://github.com/MasaTate/HanDyVQA

Supplementary Material: zip

Primary Area: Datasets & Benchmarks for applications in computer vision

Submission Number: 1309

Loading