P2T: Pyramid Pooling Transformer for Scene Understanding
Nankai University · Alibaba Group (China) · +2 more institutions
Abstract
Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid…
Citation impact
- FWCI
- 27.21
- Percentile
- 100%
- References
- 92
Authors
4Topics & keywords
- Pooling
- Transformer
- Computer science
- Segmentation
- Artificial intelligence
- Pyramid (geometry)
- Computer vision
- Motif (music)