Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

The University of Texas at Austin · Meta (United States) · +1 more institution

Indexed incrossref

Abstract

Vision Transformers (ViTs) have emerged with superior performance on computer vision tasks compared to the convolutional neural network (CNN)-based models. However, ViTs mainly designed for image classification will generate single-scale low-resolution representations, which makes dense prediction tasks such as semantic segmentation challenging for ViTs. Therefore, we propose HRViT, which enhances ViTs to learn semantically-rich and spatially-precise multi-scale representations by integrating high-resolution multi-branch architectures with ViTs. We balance the model performance and efficiency of HRViT by various branch-block co-optimization techniques. Specifically, we explore heterogeneous branch designs,…

Citation impact

238
total citations
FWCI
13.45
Percentile
100%
References
69
Citations per year

Authors

9

Topics & keywords

Keywords
  • Computer science
  • Segmentation
  • Convolutional neural network
  • Artificial intelligence
  • Redundancy (engineering)
  • Subnetwork
  • Transformer
  • Block (permutation group theory)
UN Sustainable Development Goals
  • Sustainable cities and communities
No related works found for this paper.