articlearXiv (Cornell University)Jun 7, 2019GREEN OA

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million\n Narrated Video Clips

Centre National de la Recherche Scientifique · Institut national de recherche en sciences et technologies du numérique · +4 more institutions

Indexed inarxiv

Abstract

Learning text-video embeddings usually requires a dataset of video clips with\nmanually provided captions. However, such datasets are expensive and time\nconsuming to create and therefore difficult to obtain on a large scale. In this\nwork, we propose instead to learn such embeddings from video data with readily\navailable natural language annotations in the form of automatically transcribed\nnarrations. The contributions of this work are three-fold. First, we introduce\nHowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M\nnarrated instructional web videos depicting humans performing and describing\nover 23k different visual tasks. Our data collection procedure is fast,\nscalable and…

No related works found for this paper.