RMT: Retentive Networks Meet Vision Transformers
Chinese Academy of Sciences · Institute of Automation · +1 more institution
Abstract
Vision Transformer (ViT) has gained increasing attention in the computer vision community in recent years. How-ever, the core component of ViT, Self-Attention, lacks ex-plicit spatial priors and bears a quadratic computational complexity, thereby constraining the applicability of ViT. To alleviate these issues, we draw inspiration from the re-cent Retentive Network (RetNet) in the field of NLP, and propose RMT, a strong vision backbone with explicit spa-tial prior for general purposes. Specifically, we extend the RetNet's temporal decay mechanism to the spatial do-main, and propose a spatial decay matrix based on the Manhattan distance to introduce the explicit spatial prior to Self-Attention. Additionally, an…
Citation impact
- FWCI
- 42.12
- Percentile
- 100%
- References
- 109
Authors
5- QFQihang FanCorresponding
Chinese Academy of Sciences, Institute of Automation
- HHHuaibo Huang
Institute of Automation, Chinese Academy of Sciences
- MCMingrui Chen
Institute of Automation, Chinese Academy of Sciences
- HLHongmin Liu
University of Science and Technology Beijing
- RHRan He
Institute of Automation, Chinese Academy of Sciences
Topics & keywords
- Computer science
- Transformer
- Electrical engineering
- Engineering
- Voltage