Pointing Gesture Understanding via Visual Prompting and Visual Question Answering for Interactive Robot Navigation

Published: 05 Apr 2024, Last Modified: 30 Apr 2024VLMNM 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision language models, gesture understanding, navigation
TL;DR: This paper describes a method of VLM-based gesture understanding using visual prompts for interactive robot navigation.
Abstract: In this paper, we explore a method of visual robot navigation that interprets human's gesture pointing towards desired directions and moves following the instructions with Vision Language Models (VLMs). In this method, we provide rating scales for Visual Question Answering (VQA) in visual or text prompts to VLMs to measure ambiguous pointing gestures. A VLM takes prefix texts and an observation image of human pointing with visual prompts, and output the pointing scale that can be utilized for robot navigation. We validate two gesture rating scales and three visual clues with a pointing gesture dataset. The results demonstrate the difficulty of reliably accomplishing the targeted tasks, and show the future direction of our research.
Submission Number: 14
Loading