Keywords: Customized Procedure Planning, multi-modal models, vision-language models
Abstract: Generating customized procedures for task planning in instructional videos poses a unique challenge for vision-language models. In this paper, we introduce Customized Procedure Planning in Instructional Videos, a novel task that focuses on generating a sequence of detailed action steps for task completion based on user requirements and the task's initial visual state. Existing methods often neglect customization and user directions, limiting their real-world applicability. The absence of instructional video datasets with step-level state and video-specific action plan annotations has hindered progress in this domain. To address these challenges, we introduce the Customized Procedure Planner (CPP) framework, a causal, open-vocabulary model that leverages a LlaVA-based approach to predict procedural plans based on a task's initial visual state and user directions. To overcome the data limitation, we employ a weakly-supervised approach, using the strong vision-language model GEMINI and the large language model (LLM) GPT-4 to create detailed video-specific action plans from the benchmark instructional video datasets (COIN, CrossTask), producing pseudo-labels for training. Discussing the limitations of the existing procedure planning evaluation metrics in an open-vocabulary setting, we propose novel automatic LLM-based metrics with few-shot in-context learning to evaluate the customization and planning capabilities of our model, setting a strong baseline. Additionally, we implement an LLM-based objective function to enhance model training for improved customization. Extensive experiments, including human evaluations, demonstrate the effectiveness of our approach, establishing a strong baseline for future research in customized procedure planning.
Supplementary Material: pdf
Primary Area: applications to robotics, autonomy, planning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12330
Loading