End-to-End Learning of Visual Representations From Uncurated Instructional Videos
Université Paris Sciences et Lettres · Institut national de recherche en informatique et en automatique · +6 more institutions
Abstract
Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to- video retrieval…
Citation impact
- FWCI
- 43.06
- Percentile
- 100%
- References
- 126
Authors
6- AMAntoine MiechCorresponding
Université Paris Sciences et Lettres, Institut national de recherche en informatique et en automatique, Centre National de la Recherche Scientifique, École Normale Supérieure - PSL
- JAJean-Baptiste Alayrac
DeepMind (United Kingdom)
- LSLucas Smaira
DeepMind (United Kingdom)
- ILIvan Laptev
Institut national de recherche en informatique et en automatique
- JŠJosef Šivic
Czech Technical University in Prague, Institute of Informatics of the Slovak Academy of Sciences, Institut national de recherche en informatique et en automatique
Topics & keywords
- Computer science
- Action recognition
- Annotation
- Scalability
- Segmentation
- Artificial intelligence
- Action (physics)
- Scratch
- Quality Education