Abstract: Large language models (LLMs) have been widely used as procedural planners, providing step-by-step guidance across applications.
However, in a human-assistive scenario where the environment and users' knowledge constantly change, their ability to detect various step types for alternative plan generation remains under-explored. To fill this gap, we assess whether models can identify steps that are:
(i) sequential, (ii) interchangeable, and (iii) optional in textual instructions. We compare LLMs to two vision-aware models relevant for procedural understanding: a large vision-language model and a heuristic approach that uses video-mined knowledge graphs.
Our results indicate that LLMs struggle to capture the notion of mutual exclusivity between sequential and interchangeable steps.
Furthermore, we report comprehensive analyses highlighting the advantages and limitations of using LLMs as procedural task guides.
While the largest LLM shows expert-level task knowledge, our findings reveal its limitations in several key areas: broad task coverage, robustness towards diverse user phrasings, and physical reasoning.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, semantic relationships, knowledge tracing/discovering/inducing, robutsness
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 947
Loading