TL;DR: We present EasyGen, an efficient model that facilitates multimodal understanding and generation by harnessing the strengths of diffusion models and large language models (LLMs).
Abstract: We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs). Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge the gap between modalities, EasyGen leverages a bidirectional conditional diffusion model named BiDiffuser, which promotes more efficient interactions between modalities. EasyGen handles image-to-text generation by integrating BiDiffuser and an LLM via a simple projection layer. Unlike most existing multimodal models that are limited to generating text responses, EasyGen can also facilitate text-to-image generation by utilizing the LLM to create textual descriptions, which can be interpreted by BiDiffuser to generate appropriate visual responses. Furthermore, EasyGen can be effortlessly integrated into existing advanced multimodal LLMs like LLaVA to improve their performance. Extensive quantitative and qualitative experiments demonstrate the effectiveness of EasyGen, whose training can be easily achieved in a lab setting.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
0 Replies
Loading