RMT: Retentive Networks Meet Vision Transformers

Fan, Qihang; Huang, Huaibo; Chen, Mingrui; Liu, Hongmin; He, Ran

doi:10.1109/cvpr52733.2024.00539

articleJun 16, 2024Closed access

RMT: Retentive Networks Meet Vision Transformers

QFQihang Fan HHHuaibo Huang MCMingrui Chen HLHongmin Liu RHRan He

Chinese Academy of Sciences · Institute of Automation · +1 more institution

Indexed incrossref

Abstract

Vision Transformer (ViT) has gained increasing attention in the computer vision community in recent years. How-ever, the core component of ViT, Self-Attention, lacks ex-plicit spatial priors and bears a quadratic computational complexity, thereby constraining the applicability of ViT. To alleviate these issues, we draw inspiration from the re-cent Retentive Network (RetNet) in the field of NLP, and propose RMT, a strong vision backbone with explicit spa-tial prior for general purposes. Specifically, we extend the RetNet's temporal decay mechanism to the spatial do-main, and propose a spatial decay matrix based on the Manhattan distance to introduce the explicit spatial prior to Self-Attention. Additionally, an…

Citation impact

186

total citations

FWCI: 42.12
Percentile: 100%
References: 109

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Computer science
Transformer
Electrical engineering
Engineering
Voltage

No related works found for this paper.