Video Swin Transformer

Liu, Ze; Jia, Ning; Cao, Yue; Wei, Yixuan; Zhang, Zheng; Lin, Stephen; Hu, Han

doi:10.1109/cvpr52688.2022.00320

article2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jun 1, 2022Closed access

Video Swin Transformer

ZLZe Liu NJNing Jia YCYue Cao YWYixuan Wei ZZZheng Zhang

University of Science and Technology of China · Microsoft Research Asia (China) · +2 more institutions

Indexed incrossref

Abstract

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage…

Citation impact

1,875

total citations

FWCI: 103.32
Percentile: 100%
References: 53

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Computer science
Locality
Transformer
Artificial intelligence
Leverage (statistics)
Action recognition
Architecture
Computer vision

UN Sustainable Development Goals

Sustainable cities and communities

No related works found for this paper.