Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

Gu, Jiaqi; Kwon, Hyoukjun; Wang, Dilin; Ye, Wei; Li, Meng; Chen, Yu‐Hsin; Lai, Liangzhen; Chandra, Vikas; Pan, David Z.

doi:10.1109/cvpr52688.2022.01178

article2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jun 1, 2022Closed access

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

JGJiaqi Gu HKHyoukjun Kwon DWDilin Wang WYWei Ye MLMeng Li

The University of Texas at Austin · Meta (United States) · +1 more institution

Indexed incrossref

Abstract

Vision Transformers (ViTs) have emerged with superior performance on computer vision tasks compared to the convolutional neural network (CNN)-based models. However, ViTs mainly designed for image classification will generate single-scale low-resolution representations, which makes dense prediction tasks such as semantic segmentation challenging for ViTs. Therefore, we propose HRViT, which enhances ViTs to learn semantically-rich and spatially-precise multi-scale representations by integrating high-resolution multi-branch architectures with ViTs. We balance the model performance and efficiency of HRViT by various branch-block co-optimization techniques. Specifically, we explore heterogeneous branch designs,…

Citation impact

238

total citations

FWCI: 13.45
Percentile: 100%
References: 69

Citations per year

Authors

9

Topics & keywords

Topics

Keywords

Computer science
Segmentation
Convolutional neural network
Artificial intelligence
Redundancy (engineering)
Subnetwork
Transformer
Block (permutation group theory)

UN Sustainable Development Goals

Sustainable cities and communities

No related works found for this paper.