P2T: Pyramid Pooling Transformer for Scene Understanding

Wu, Yu-Huan; Liu, Yun; Zhan, Xin; Cheng, Ming‐Ming

doi:10.1109/tpami.2022.3202765

articleIEEE Transactions on Pattern Analysis and Machine IntelligenceAug 30, 2022Closed access

P2T: Pyramid Pooling Transformer for Scene Understanding

YWYu-Huan Wu YLYun Liu XZXin Zhan MCMing‐Ming Cheng

Nankai University · Alibaba Group (China) · +2 more institutions

PubMed

Indexed incrossrefpubmed

Abstract

Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid…

Citation impact

290

total citations

FWCI: 27.21
Percentile: 100%
References: 92

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Pooling
Transformer
Computer science
Segmentation
Artificial intelligence
Pyramid (geometry)
Computer vision
Motif (music)

No related works found for this paper.