Keywords: Hand-Object Interaction, VideoQA, Vision and Language
TL;DR: We introduce a new video QA benchmark for evaluating video models’ ability to recognize fine-grained hand-object interactions.
Abstract: Hand-Object Interaction (HOI) is inherently a dynamic process, involving nuanced spatial coordination, diverse manipulation styles, and influences on interacting objects.
However, existing HOI benchmarks tend to emphasize high-level action recognition and hand/object localization while neglecting the fine-grained aspects of hand-object dynamics.
We introduce HanDyVQA, a video question-answering benchmark for understanding the fine-grained spatiotemporal dynamics in hand-object interactions.
HanDyVQA consists of six types of questions (Action, Process, Objects, Location, State Change, and Object Parts), totaling 11.7k multiple-choice question-answer pairs and 11k instance segmentations that require discerning fine-grained action contexts, hand-object movements, and state changes caused by manipulation.
We evaluated several video foundation models on our benchmark and found that even the powerful Qwen2.5-VL-72B reached only 68.8% average accuracy, uncovering new challenges in component-level geometric and semantic understanding through extensive analyses.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/aist-cvrt/HanDyVQA
Code URL: https://github.com/MasaTate/HanDyVQA
Supplementary Material: zip
Primary Area: Datasets & Benchmarks for applications in computer vision
Submission Number: 1309
Loading