PVT v2: Improved baselines with pyramid vision transformer

Wang, Wenhai; Xie, Enze; Li, Xiang; Fan, Deng-Ping; Song, Kaitao; Liang, Ding; Lü, Tong; Luo, Ping; Shao, Ling

doi:10.1007/s41095-022-0274-8

articleComputational Visual MediaMar 16, 2022DIAMOND OA

PVT v2: Improved baselines with pyramid vision transformer

WWWenhai Wang EXEnze Xie XLXiang Li DFDeng-Ping Fan KSKaitao Song

Nanjing University of Science and Technology · Shanghai Artificial Intelligence Laboratory · +5 more institutions

Indexed inarxivcrossrefdatacitedoaj

Abstract

Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer…

Citation impact

2,145

total citations

FWCI: 194.24
Percentile: 100%
References: 66

Citations per year

Authors

9

Topics & keywords

Topics

Keywords

Transformer
Computer science
Segmentation
Embedding
Artificial intelligence
Computation
Computer vision
Computer engineering

No related works found for this paper.

Funding

NN
National Natural Science Foundation of China
Awards: 61672273, 61832008, BK20160021