All are Worth Words: A ViT Backbone for Diffusion Models
Tsinghua University · Renmin University of China · +1 more institution
Abstract
Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and classconditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve…
Citation impact
- FWCI
- 23.98
- Percentile
- 100%
- References
- 131
Authors
7Topics & keywords
- Upsampling
- Computer science
- Convolutional neural network
- Generative model
- Artificial intelligence
- Generative grammar
- Image (mathematics)
- Deep learning