articleJun 16, 2024Closed access

RMT: Retentive Networks Meet Vision Transformers

Chinese Academy of Sciences · Institute of Automation · +1 more institution

Indexed incrossref

Abstract

Vision Transformer (ViT) has gained increasing attention in the computer vision community in recent years. How-ever, the core component of ViT, Self-Attention, lacks ex-plicit spatial priors and bears a quadratic computational complexity, thereby constraining the applicability of ViT. To alleviate these issues, we draw inspiration from the re-cent Retentive Network (RetNet) in the field of NLP, and propose RMT, a strong vision backbone with explicit spa-tial prior for general purposes. Specifically, we extend the RetNet's temporal decay mechanism to the spatial do-main, and propose a spatial decay matrix based on the Manhattan distance to introduce the explicit spatial prior to Self-Attention. Additionally, an…

No related works found for this paper.

Funding