Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation
The University of Texas at Austin · Meta (United States) · +1 more institution
Abstract
Vision Transformers (ViTs) have emerged with superior performance on computer vision tasks compared to the convolutional neural network (CNN)-based models. However, ViTs mainly designed for image classification will generate single-scale low-resolution representations, which makes dense prediction tasks such as semantic segmentation challenging for ViTs. Therefore, we propose HRViT, which enhances ViTs to learn semantically-rich and spatially-precise multi-scale representations by integrating high-resolution multi-branch architectures with ViTs. We balance the model performance and efficiency of HRViT by various branch-block co-optimization techniques. Specifically, we explore heterogeneous branch designs,…
Citation impact
- FWCI
- 13.45
- Percentile
- 100%
- References
- 69
Authors
9Topics & keywords
- Computer science
- Segmentation
- Convolutional neural network
- Artificial intelligence
- Redundancy (engineering)
- Subnetwork
- Transformer
- Block (permutation group theory)
- Sustainable cities and communities