MPViT: Multi-Path Vision Transformer for Dense Prediction

Lee, Youngwan; Kim, Jonghee; Willette, Jeffrey; Hwang, Sung Ju

doi:10.1109/cvpr52688.2022.00714

article2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jun 1, 2022Closed access

MPViT: Multi-Path Vision Transformer for Dense Prediction

YLYoungwan Lee JKJonghee Kim JWJeffrey Willette SJSung Ju Hwang

Electronics and Telecommunications Research Institute · Korea Advanced Institute of Science and Technology

Indexed incrossref

Abstract

Dense computer vision tasks such as object detection and segmentation require effective multi-scale feature representation for detecting or classifying objects or regions with varying sizes. While Convolutional Neural Networks (CNNs) have been the dominant architectures for such tasks, recently introduced Vision Transformers (ViTs) aim to replace them as a backbone. Similar to CNNs, ViTs build a simple multi-stage structure (i.e., fine-to-coarse) for multi-scale representation with single-scale patches. In this work, with a different perspective from existing Transformers, we explore multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT). MPViT embeds…

Citation impact

342

total citations

FWCI: 18.95
Percentile: 100%
References: 86

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Computer science
Artificial intelligence
Convolutional neural network
Embedding
Segmentation
Transformer
Pattern recognition (psychology)
Encoder

No related works found for this paper.

Funding

NR
National Research Foundation of Korea
Award: NRF-2018R1A5A1059921