articleJun 1, 2023Closed access

All are Worth Words: A ViT Backbone for Diffusion Models

Tsinghua University · Renmin University of China · +1 more institution

Indexed incrossref

Abstract

Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and classconditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve…

Citation impact

211
total citations
FWCI
23.98
Percentile
100%
References
131
Citations per year

Authors

7

Topics & keywords

Keywords
  • Upsampling
  • Computer science
  • Convolutional neural network
  • Generative model
  • Artificial intelligence
  • Generative grammar
  • Image (mathematics)
  • Deep learning
No related works found for this paper.

Funding