HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million\n Narrated Video Clips

Miech, Antoine; Zhukov, Dimitri; Alayrac, Jean-Baptiste; Tapaswi, Makarand; Laptev, Ivan; Šivic, Josef

doi:10.48550/arxiv.1906.03327

articlearXiv (Cornell University)Jun 7, 2019GREEN OA

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million\n Narrated Video Clips

AMAntoine Miech DZDimitri Zhukov JAJean-Baptiste Alayrac MTMakarand Tapaswi ILIvan Laptev

Centre National de la Recherche Scientifique · Institut national de recherche en sciences et technologies du numérique · +4 more institutions

Indexed inarxiv

Abstract

Learning text-video embeddings usually requires a dataset of video clips with\nmanually provided captions. However, such datasets are expensive and time\nconsuming to create and therefore difficult to obtain on a large scale. In this\nwork, we propose instead to learn such embeddings from video data with readily\navailable natural language annotations in the form of automatically transcribed\nnarrations. The contributions of this work are three-fold. First, we introduce\nHowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M\nnarrated instructional web videos depicting humans performing and describing\nover 23k different visual tasks. Our data collection procedure is fast,\nscalable and…

Citation impact

902

total citations

FWCI: —
Percentile: —
References: 68

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Computer science
CLIPS
Embedding
Annotation
Scalability
Information retrieval
Artificial intelligence
Code (set theory)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.