EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
Chinese University of Hong Kong · Microsoft Research (United Kingdom)
Abstract
Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named Efficient ViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that…
Citation impact
- FWCI
- 89.36
- Percentile
- 100%
- References
- 116
Authors
6Topics & keywords
- Computer science
- Computation
- Xeon
- Parallel computing
- Speedup
- Redundancy (engineering)
- Transformer
- Application-specific integrated circuit