Customized Procedure Planning in Instructional Videos

Ali Zare; Constantine Tsibouris; Xudong Lin; Li Zhang; Ming-Hsuan Yang; Shih-Fu Chang

Customized Procedure Planning in Instructional Videos

Ali Zare, Constantine Tsibouris, Xudong Lin, Li Zhang, Ming-Hsuan Yang, Shih-Fu Chang

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Customized Procedure Planning, multi-modal models, vision-language models

Abstract: Generating customized procedures for task planning in instructional videos poses a unique challenge for vision-language models. In this paper, we introduce Customized Procedure Planning in Instructional Videos, a novel task that focuses on generating a sequence of detailed action steps for task completion based on user requirements and the task's initial visual state. Existing methods often neglect customization and user directions, limiting their real-world applicability. The absence of instructional video datasets with step-level state and video-specific action plan annotations has hindered progress in this domain. To address these challenges, we introduce the Customized Procedure Planner (CPP) framework, a causal, open-vocabulary model that leverages a LlaVA-based approach to predict procedural plans based on a task's initial visual state and user directions. To overcome the data limitation, we employ a weakly-supervised approach, using the strong vision-language model GEMINI and the large language model (LLM) GPT-4 to create detailed video-specific action plans from the benchmark instructional video datasets (COIN, CrossTask), producing pseudo-labels for training. Discussing the limitations of the existing procedure planning evaluation metrics in an open-vocabulary setting, we propose novel automatic LLM-based metrics with few-shot in-context learning to evaluate the customization and planning capabilities of our model, setting a strong baseline. Additionally, we implement an LLM-based objective function to enhance model training for improved customization. Extensive experiments, including human evaluations, demonstrate the effectiveness of our approach, establishing a strong baseline for future research in customized procedure planning.

Supplementary Material: pdf

Primary Area: applications to robotics, autonomy, planning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12330

Loading