SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

Microsoft Research (United Kingdom)

Indexed incrossref

Abstract

The canonical approach to video captioning dictates a caption generation model to learn from offline-extracted dense video features. These feature extractors usually operate on video frames sampled at a fixed frame rate and are often trained on image/video understanding tasks, without adaption to video captioning data. In this work, we present SwinBERT, an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description. Instead of leveraging multiple 2D/3D feature extractors, our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input without dedicated…

Citation impact

267
total citations
FWCI
14.81
Percentile
100%
References
99
Citations per year

Authors

8

Topics & keywords

Keywords
  • Closed captioning
  • Computer science
  • Artificial intelligence
  • Transformer
  • Computer vision
  • Redundancy (engineering)
  • Image (mathematics)
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.