Intermediate Adapter: Efficient Alignment of Text in Diffusion Models

ACL ARR 2024 June Submission3523 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Diffusion models have been widely used for text-to-image generation tasks. However, state-of-the-art models still fail to align the generated visual concepts with high-level semantics in a language such as object count, spatial relationship, etc. We approach this problem from an architectural perspective and investigate how conditioning architecture can affect vision-language alignment in diffusion models. We propose a new conditioning architecture named Intermediate Adapter to improve text-to-image alignment, generation quality, as well as training and inference speed for diffusion models. We perform experiments on the text-to-image generation task on the MS-COCO dataset. We apply Intermediate Adapters on two common conditioning methods on a U-ViT backbone. For both end-to-end training and fine-tuning of pretrained diffusion models, Our method boosts the CLIP Score, FID, and human evaluation results of the generated images, with 20% reduced FLOPs and 25% increased training and inference speed.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: image text matching, efficient models, model architectures
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 3523
Loading