Video Swin Transformer
University of Science and Technology of China · Microsoft Research Asia (China) · +2 more institutions
Abstract
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage…
Citation impact
- FWCI
- 103.32
- Percentile
- 100%
- References
- 53
Authors
7- ZLZe LiuCorresponding
University of Science and Technology of China, Microsoft Research Asia (China)
- NJNing Jia
Microsoft Research Asia (China), Huazhong University of Science and Technology
- YCYue Cao
Microsoft Research Asia (China)
- YWYixuan Wei
Microsoft Research Asia (China), Tsinghua University
- ZZZheng Zhang
Microsoft Research Asia (China)
Topics & keywords
- Computer science
- Locality
- Transformer
- Artificial intelligence
- Leverage (statistics)
- Action recognition
- Architecture
- Computer vision
- Sustainable cities and communities