SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Microsoft Research (United Kingdom)
Abstract
The canonical approach to video captioning dictates a caption generation model to learn from offline-extracted dense video features. These feature extractors usually operate on video frames sampled at a fixed frame rate and are often trained on image/video understanding tasks, without adaption to video captioning data. In this work, we present SwinBERT, an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description. Instead of leveraging multiple 2D/3D feature extractors, our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input without dedicated…
Citation impact
- FWCI
- 14.81
- Percentile
- 100%
- References
- 99
Authors
8Topics & keywords
- Closed captioning
- Computer science
- Artificial intelligence
- Transformer
- Computer vision
- Redundancy (engineering)
- Image (mathematics)
- Quality Education