EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Liu, Xinyu; Peng, Houwen; Zheng, Ningxin; Yang, Yuqing; Hu, Han; Yuan, Yixuan

doi:10.1109/cvpr52729.2023.01386

articleJun 1, 2023Closed access

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

XLXinyu Liu HPHouwen Peng NZNingxin Zheng YYYuqing Yang HHHan Hu

Chinese University of Hong Kong · Microsoft Research (United Kingdom)

Indexed incrossref

Abstract

Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named Efficient ViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that…

Citation impact

719

total citations

FWCI: 89.36
Percentile: 100%
References: 116

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Computer science
Computation
Xeon
Parallel computing
Speedup
Redundancy (engineering)
Transformer
Application-specific integrated circuit

No related works found for this paper.