Improvisational Reasoning with Vision-Language Models for Grounded Procedural Planning

Md Masudur Rahman; Yupeng Zhuo; Juan Wachs

Improvisational Reasoning with Vision-Language Models for Grounded Procedural Planning

Md Masudur Rahman, Yupeng Zhuo, Juan Wachs

Published: 19 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 Workshop EWMEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Improvisational reasoning, Vision-language models (VLMs), Grounded procedural planning, Grounded Decision-making

TL;DR: We introduce a framework that enables vision-language models to perform grounded procedural improvisation by generating action-object plans that adapt to missing tools in real-world scenes.

Abstract: Improvisation is a hallmark of human intelligence, particularly in high-stakes domains such as emergency medicine, where ideal tools are often unavailable and practitioners must adapt procedures using what is at hand. While recent vision-language models (VLMs) have demonstrated strong general reasoning and perception abilities, they remain inadequate for grounded procedural adaptation under constraints. In this paper, we introduce ImPlan, an improvisational reasoning framework that augments VLMs with structured planning and transformation-aware substitution. ImPlan generates action-object graphs that adapt procedural goals to context-specific affordances in the scene. Experiments on a benchmark of expert-annotated emergency procedures show that ImPlan significantly outperforms direct VLM prompting, both proprietary and open-weight models, even when built on weaker backbone models. On average, ImPlan improves groundness scores by up to 70.8% and plausibility scores by up to 28.6%, achieving simultaneous gains in visual grounding and logical coherence. ImPlan offers a potentially generalizable path for grounded decision-making in resource-limited environments.

Submission Number: 75

Loading