HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million\n Narrated Video Clips
Centre National de la Recherche Scientifique · Institut national de recherche en sciences et technologies du numérique · +4 more institutions
Abstract
Learning text-video embeddings usually requires a dataset of video clips with\nmanually provided captions. However, such datasets are expensive and time\nconsuming to create and therefore difficult to obtain on a large scale. In this\nwork, we propose instead to learn such embeddings from video data with readily\navailable natural language annotations in the form of automatically transcribed\nnarrations. The contributions of this work are three-fold. First, we introduce\nHowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M\nnarrated instructional web videos depicting humans performing and describing\nover 23k different visual tasks. Our data collection procedure is fast,\nscalable and…
Citation impact
- FWCI
- —
- Percentile
- —
- References
- 68
Authors
6- AMAntoine MiechCorresponding
Centre National de la Recherche Scientifique, Institut national de recherche en sciences et technologies du numérique, Université Paris Sciences et Lettres, École Normale Supérieure - PSL
- DZDimitri Zhukov
Centre National de la Recherche Scientifique, Institut national de recherche en sciences et technologies du numérique, Université Paris Sciences et Lettres, École Normale Supérieure - PSL
- JAJean-Baptiste Alayrac
Institut national de recherche en sciences et technologies du numérique
- MTMakarand Tapaswi
Institut national de recherche en sciences et technologies du numérique
- ILIvan Laptev
Centre National de la Recherche Scientifique, Institut national de recherche en sciences et technologies du numérique, Université Paris Sciences et Lettres, École Normale Supérieure - PSL
Topics & keywords
- Computer science
- CLIPS
- Embedding
- Annotation
- Scalability
- Information retrieval
- Artificial intelligence
- Code (set theory)
- Quality Education