Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

ICLR 2025 Conference Submission1751 Authors

19 Sept 2024 (modified: 22 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: procedure planning, diffusion, U-Net, temporal logic interpolation, action prediction, mask
TL;DR: A masked temporal interpolation predictor for procedure planning in instructional videos
Abstract: In this paper, we study the problem of procedure planning in instructional videos, which involves making goal-directed plans based on current visual observations in unstructured, real-life videos. Prior research leverages different forms of supervision to bridge the gap between observed states and unobserved actions. Building on this foundation, we propose an innovative approach by introducing a latent space temporal logical interpolation module within the diffusion model framework. This module enables the intermediate supervision of temporal logical relationships that were previously nonexistent. In terms of details, we employ an interpolator to guide the intermediate process within the diffusion model, using the start and end observation features as inputs. This involves extracting latent features through an encoder and applying an interpolation strategy with transformer encoder blocks to derive the latent features. Furthermore, to ensure the accuracy of actions in the outputs, we implement a masking strategy to constrain the scope of predictions and a task-adaptive masked proximity loss for the training process. Results across these three datasets of varying scales demonstrate that our MTID model achieves state-of-the-art performance on the overwhelming majority of key metrics. The code is available at https://anonymous.4open.science/r/MTID-E2E3/README.md.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1751
Loading