articleJun 1, 2016Closed access

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Microsoft Research Asia (China)

Indexed incrossref

Abstract

While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on specific fine-grained domains with limited videos and simple descriptions. While researchers have provided several benchmark datasets for image captioning, we are not aware of any large-scale video description dataset with comprehensive categories yet diverse video content. In this paper we present MSR-VTT (standing for "MSRVideo to Text") which is a new…

Citation impact

1,735
total citations
FWCI
49.67
Percentile
100%
References
63
Citations per year

Authors

4

Topics & keywords

Keywords
  • Computer science
  • Automatic summarization
  • Closed captioning
  • Bridging (networking)
  • Vocabulary
  • Sentence
  • Benchmark (surveying)
  • Task (project management)
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.