All are Worth Words: A ViT Backbone for Diffusion Models

Bao, Fan; Nie, Shen; Xue, Kaiwen; Cao, Yue; Li, Chongxuan; Su, Hang; Zhu, Jun

doi:10.1109/cvpr52729.2023.02171

articleJun 1, 2023Closed access

All are Worth Words: A ViT Backbone for Diffusion Models

FBFan Bao SNShen Nie KXKaiwen Xue YCYue Cao CLChongxuan Li

Tsinghua University · Renmin University of China · +1 more institution

Indexed incrossref

Abstract

Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and classconditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve…

Citation impact

211

total citations

FWCI: 23.98
Percentile: 100%
References: 131

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Upsampling
Computer science
Convolutional neural network
Generative model
Artificial intelligence
Generative grammar
Image (mathematics)
Deep learning

No related works found for this paper.