DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

Jiao, Jiayu; Tang, Yu-Ming; Lin, Kun-Yu; Gao, Yipeng; Andy, J.; Wang, Yaowei; Zheng, Wei‐Shi

doi:10.1109/tmm.2023.3243616

articleIEEE Transactions on MultimediaJan 1, 2023Closed access

DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

JJJiayu Jiao YTYu-Ming Tang KLKun-Yu Lin YGYipeng Gao JAJ. Andy

Sun Yat-sen University · Peng Cheng Laboratory

Indexed incrossref

Abstract

As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global attended receptive field leads to quadratic computational cost. Another branch of Vision Transformers exploits local attention inspired by CNNs, which only models the interactions between patches in small neighborhoods. Although such a solution reduces the computational cost, it naturally suffers from small attended receptive fields, which may limit the performance. In this work, we explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field. By analyzing the patch…

Citation impact

332

total citations

FWCI: 37.77
Percentile: 100%
References: 125

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Computer science
Transformer
Artificial intelligence
Redundancy (engineering)
Exploit
Pattern recognition (psychology)
Theoretical computer science
Computer vision

UN Sustainable Development Goals

Sustainable cities and communities

No related works found for this paper.