P2T: Pyramid Pooling Transformer for Scene Understanding

Nankai University · Alibaba Group (China) · +2 more institutions

PubMed
Indexed incrossrefpubmed

Abstract

Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers, where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid…

No related works found for this paper.

Funding