Abstract

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage…

Citation impact

1,875
total citations
FWCI
103.32
Percentile
100%
References
53
Citations per year

Authors

7

Topics & keywords

Keywords
  • Computer science
  • Locality
  • Transformer
  • Artificial intelligence
  • Leverage (statistics)
  • Action recognition
  • Architecture
  • Computer vision
UN Sustainable Development Goals
  • Sustainable cities and communities
No related works found for this paper.