MPViT: Multi-Path Vision Transformer for Dense Prediction
Electronics and Telecommunications Research Institute · Korea Advanced Institute of Science and Technology
Abstract
Dense computer vision tasks such as object detection and segmentation require effective multi-scale feature representation for detecting or classifying objects or regions with varying sizes. While Convolutional Neural Networks (CNNs) have been the dominant architectures for such tasks, recently introduced Vision Transformers (ViTs) aim to replace them as a backbone. Similar to CNNs, ViTs build a simple multi-stage structure (i.e., fine-to-coarse) for multi-scale representation with single-scale patches. In this work, with a different perspective from existing Transformers, we explore multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT). MPViT embeds…
Citation impact
- FWCI
- 18.95
- Percentile
- 100%
- References
- 86
Authors
4- YLYoungwan LeeCorresponding
Electronics and Telecommunications Research Institute, Korea Advanced Institute of Science and Technology
- JKJonghee Kim
Electronics and Telecommunications Research Institute
- JWJeffrey Willette
Korea Advanced Institute of Science and Technology
- SJSung Ju Hwang
Korea Advanced Institute of Science and Technology
Topics & keywords
- Computer science
- Artificial intelligence
- Convolutional neural network
- Embedding
- Segmentation
- Transformer
- Pattern recognition (psychology)
- Encoder