All are Worth Words: a ViT Backbone for Score-based Diffusion Models

Fan Bao; Chongxuan Li; Yue Cao; Jun Zhu

All are Worth Words: a ViT Backbone for Score-based Diffusion Models

Fan Bao, Chongxuan Li, Yue Cao, Jun Zhu

Published: 29 Nov 2022, Last Modified: 26 May 2025SBM 2022 PosterReaders: Everyone

Keywords: score-based model, diffusion model, vision transformer

TL;DR: Propose a vision transformer backbone for score-based diffusion models

Abstract: Vision transformers (ViT) have shown promise in various vision tasks including low-level ones while the U-Net remains dominant in score-based diffusion models. In this paper, we perform a systematical empirical study on the ViT-based architectures in diffusion models. Our results suggest that adding extra long skip connections (like the U-Net) to ViT is crucial to diffusion models. The new ViT architecture, together with other improvements, is referred to as U-ViT. On several popular visual datasets, U-ViT achieves competitive generation results to SOTA U-Net while requiring comparable amount of parameters and computation if not less.

Student Paper: Yes

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/all-are-worth-words-a-vit-backbone-for-score/code)

1 Reply

Loading