Making Multimodal Generation Easier: When Diffusion Models Meet LLMs

Anonymous

Making Multimodal Generation Easier: When Diffusion Models Meet LLMs

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone

TL;DR: We present EasyGen, an efficient model that facilitates multimodal understanding and generation by harnessing the strengths of diffusion models and large language models (LLMs).

Abstract: We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs). Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge the gap between modalities, EasyGen leverages a bidirectional conditional diffusion model named BiDiffuser, which promotes more efficient interactions between modalities. EasyGen handles image-to-text generation by integrating BiDiffuser and an LLM via a simple projection layer. Unlike most existing multimodal models that are limited to generating text responses, EasyGen can also facilitate text-to-image generation by utilizing the LLM to create textual descriptions, which can be interpreted by BiDiffuser to generate appropriate visual responses. Furthermore, EasyGen can be effortlessly integrated into existing advanced multimodal LLMs like LLaVA to improve their performance. Extensive quantitative and qualitative experiments demonstrate the effectiveness of EasyGen, whose training can be easily achieved in a lab setting.

Paper Type: long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

0 Replies

Loading