Fusing Vision and Language Models to Generate Sequence of Recipe Images from Steps

Hshmat Sahak

Fusing Vision and Language Models to Generate Sequence of Recipe Images from Steps

Hshmat Sahak

Published: 19 Mar 2024, Last Modified: 14 Aug 2024Tiny Papers @ ICLR 2024 ArchiveEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Models, Vision-Language Models, Generative Models, Computer Vision

TL;DR: Investigating ways to fuse vision and language model embeddings to generate sequence of recipe images from steps

Abstract: There has been a lot of work on using generative models for generating text descriptions given an image, demonstrating the power of pretrained large language models. There has also been several work on generating a sequence of text from a sequence of images, highlighting the effectiveness of fusing vision and language models to output text. In this work, we examine the effectiveness of fusing image and language models to generate a sequence of recipe images corresponding to the individual steps. We brainstorm different ways to fuse textual embeddings derived from each step to the encodings from the image, and empirically determine which is best. We also determine the relative importance of image and text encoders.

Submission Number: 164

Loading