Keywords: task planning, VLM, datasets, prompting
TL;DR: Investigates the impact of fine-tuning and prompting techniques on the planning ability of the open-source VideoLLaMA VLM on the EgoPlan-Bench benchmark
Abstract: Recent works have suggested that language-based foundation models contain commonsense knowledge and are capable of performing basic reasoning. This has significant promise in robotics for task-level planning. As an example, the recent EgoPlan-Bench benchmark studies egocentric, embodied planning, measured through multiple-choice questions on captioned videos. In this work, we thoroughly examine the benchmark using open-source 7/13B-parameter models and investigate the impact of different sources of training data, as well as prompting strategies that are widely used outside of the robotics domain. Our experiments show that (1) in-domain and out-of-domain performance is, unsurprisingly, connected with training and evaluation dataset overlap, and (2) surprisingly, prompting strategies that have been effective in other domains, fail to significantly increase performance here.
Submission Number: 28
Loading