SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

Lin, Kevin; Li, Linjie; Lin, Chung-Ching; Ahmed, Faisal; Gan, Zhe; Liu, Zicheng; Lu, Yumao; Wang, Lijuan

doi:10.1109/cvpr52688.2022.01742

article2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jun 1, 2022Closed access

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

KLKevin Lin LLLinjie Li CLChung-Ching Lin FAFaisal Ahmed ZGZhe Gan

Microsoft Research (United Kingdom)

Indexed incrossref

Abstract

The canonical approach to video captioning dictates a caption generation model to learn from offline-extracted dense video features. These feature extractors usually operate on video frames sampled at a fixed frame rate and are often trained on image/video understanding tasks, without adaption to video captioning data. In this work, we present SwinBERT, an end-to-end transformer-based model for video captioning, which takes video frame patches directly as inputs, and outputs a natural language description. Instead of leveraging multiple 2D/3D feature extractors, our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input without dedicated…

Citation impact

267

total citations

FWCI: 14.81
Percentile: 100%
References: 99

Citations per year

Authors

8

Topics & keywords

Topics

Keywords

Closed captioning
Computer science
Artificial intelligence
Transformer
Computer vision
Redundancy (engineering)
Image (mathematics)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.