Multiview Transformers for Video Recognition

Michigan State University · Google (United States) · +1 more institution

Indexed incrossref

Abstract

Video understanding requires reasoning at multiple spatiotemporal resolutions – from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the state-of-the-art, they have not explicitly modelled different spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. We present thorough ablation studies of our model and show that MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range…

Citation impact

283
total citations
FWCI
16.00
Percentile
100%
References
118
Citations per year

Authors

7

Topics & keywords

Keywords
  • Computer science
  • Encoder
  • Transformer
  • Artificial intelligence
  • Fuse (electrical)
  • Computer vision
  • Engineering
UN Sustainable Development Goals
  • Sustainable cities and communities
No related works found for this paper.