Keywords: Diffusion Models, Vision-Language Models, Generative Models, Computer Vision
TL;DR: Investigating ways to fuse vision and language model embeddings to generate sequence of recipe images from steps
Abstract: There has been a lot of work on using generative models for generating text
descriptions given an image, demonstrating the power of pretrained large language
models. There has also been several work on generating a sequence of text from a
sequence of images, highlighting the effectiveness of fusing vision and language
models to output text. In this work, we examine the effectiveness of fusing image
and language models to generate a sequence of recipe images corresponding to the
individual steps. We brainstorm different ways to fuse textual embeddings derived
from each step to the encodings from the image, and empirically determine which
is best. We also determine the relative importance of image and text encoders.
Submission Number: 164
Loading