Keywords: Image Generation, MLP
Abstract: Although Transformer-based models have achieved significant success in image generation tasks, the computation of scaled dot-product attention for token interactions incurs substantial computational overhead. To address this issue, researchers have attempted to directly optimize the attention matrix using methods like gradient descent, treating the attention matrix as a set of learnable parameters. However, the attention matrix learned through this approach aims to capture a global interaction pattern. Specifically, for all input images, the tokens interact based on a single, learned attention matrix. Since the distribution, size, and other characteristics of objects in each image can vary, the attention matrix learned in this way is often suboptimal. To overcome this limitation, we propose {\mname}, which introduces two novel components: \textbf{1) MoE-Linear Attention Module:} We design multiple learnable attention matrices and adaptively assign a weight to each matrix for every image. These matrices are then linearly combined to form the final attention matrix. Given that there are numerous possible combinations of weights, the model can learn a more suitable combination for each image; \textbf{2) Multi-Head Module:} We partition the original channels into several heads and perform MoE-Linear Attention on each head separately. This significantly increases the diversity of attention matrix combinations for different images. Finally, we conduct experiments on MS-COCO datasets, and the results demonstrate that our method achieves 7.43 FID (with \textbf{1.19} improvement), which significantly outperforms traditional MLP-based approaches (8.62 FID), with only negligible additional computational cost.
Primary Area: generative models
Submission Number: 314
Loading