CookingCLIP：Learning a Contextualized Multimodal Embedding from Instructional Cooking Videos for Zero-shot Recipe Generation

lin wang; Jing Chen; Zhang Hongyi

CookingCLIP：Learning a Contextualized Multimodal Embedding from Instructional Cooking Videos for Zero-shot Recipe Generation

lin wang, Jing Chen, Zhang Hongyi

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: CookingCLIP：Learning a Contextualized Multimodal Embedding from Instructional Cooking Videos for Zero-shot Recipe Generation

Abstract: Cooking is one of the oldest and the most common human activities in daily life. Instructional cooking videos have also become one of the most common data sources for multi-modal visual understanding research, compared to general domains, cooking videos: (1) not only have a significantly stronger cross-modal dependency between the speech texts and their corresponding visual frames at each individual step, (2) but also have a significantly stronger cross-context dependency between sequential steps along their temporal dimensions, making it an ideal domain for contextualized multi-modal embedding and semantic understanding. This paper proposes CookingCLIP, which introduces the latest CLIP (Contrastive Language-Image Pre-training) embedding from the general domain into the specific domain of cooking understanding and makes two adaptions upon the original CLIP embedding for better customization to the cooking understanding problems: (1) from the upstream perspective, we extend the static multi-modal CLIP embedding with a temporal dimension, to facilitate context-aware semantic understanding; (2) from the downstream perspective, we introduce the concept of zero-shot embedding to sequence-to-sequence dense prediction domains, facilitating CLIP being not only capable of telling “ Which ” (cross-modal recipe generation), but also capable of telling “ When ” (cross-context recipe localization). Experiments conducted on two challenging cooking caption generation benchmarks, YouCook2 and CrossTask, demonstrate the effectiveness of the proposed embedding. The code will be released.

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3855

Loading