HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

Chen Bao; Jiarui Xu; Xiaolong Wang; Abhinav Gupta; Homanga Bharadhwaj

HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

Chen Bao, Jiarui Xu, Xiaolong Wang, Abhinav Gupta, Homanga Bharadhwaj

Published: 10 Sept 2025, Last Modified: 10 Sept 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: How can we predict future interaction trajectories of human hands in a scene given high-level colloquial task specifications in the form of natural language? In this paper, we extend the classic hand trajectory prediction task to several tasks involving explicit and implicit language queries. Our proposed tasks require extensive understanding of human daily activities and reasoning abilities about what is happening next given cues from the current scene. We also develop new benchmarks to evaluate the proposed two tasks, Vanilla Hand Prediction (VHP) and Reasoning-Based Hand Prediction (RBHP). We enable solving these tasks by integrating high-level world knowledge and reasoning capabilities of Vision-Language Models (VLMs) with the auto-regressive nature of low-level ego-centric hand trajectories. Our model, HandsOnVLM is a novel VLM that can generate textual responses and produce future hand trajectories through natural-language conversations. Our experiments show that HandsOnVLM outperforms existing task-specific methods and other VLM baselines on proposed tasks, and demonstrates its ability to effectively utilize world knowledge for reasoning about low-level human hand trajectories based on the provided context. More details can be found at https://www.chenbao.tech/handsonvlm/.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: - We conducted additional experiments exploring visual backbone fine-tuning following suggestions from reviewer tiy7y, which provided valuable insights into our model's behavior. - We addressed the spelling and grammatical issues pointed out by the reviewer tiy7 to improve the overall presentation quality. - We expanded our analysis of fast tokens based on reviewer pB3k's feedback, including additional experimental results and discussion. - Following reviewer EfpJ's suggestions, we improved the clarity and accessibility of the paper by refining explanations and enhancing the overall presentation. - We added the open-sourced link in the abstract.

Code: https://github.com/Kami-code/HandsOnVLM-release

Assigned Action Editor: ~Derek_Hoiem1

Submission Number: 4966

Loading