DINOv2: Learning Robust Visual Features without Supervision

Oquab, Maxime; Darcet, Timothée; Moutakanni, Théo; Vo, Huy; Szafraniec, Marc; Khalidov, Vasil; Fernandez, Pierre; Haziza, Daniel; Massa, Francisco; El-Nouby, Alaaeldin; Assran, Mahmoud; Ballas, Nicolas; Galuba, Wojciech; Howes, Russell; Huang, Po-Yao; Li, Shang-Wen; Misra, Ishan; Rabbat, Michael; Sharma, Vasu; Synnaeve, Gabriel; Xu, Hu; Jeǵou, Hervé; Mairal, Julien; Labatut, Patrick; Joulin, Armand; Bojanowski, Piotr

doi:10.48550/arxiv.2304.07193

preprintarXiv (Cornell University)Apr 14, 2023GREEN OA

DINOv2: Learning Robust Visual Features without Supervision

MOMaxime Oquab TDTimothée Darcet TMThéo Moutakanni HVHuy Vo MSMarc Szafraniec

Institut national de recherche en informatique et en automatique

Indexed inarxivdatacite

Abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and…

Citation impact

1,029

total citations

FWCI: —
Percentile: —
References: 131

Citations per year

Authors

26

Topics & keywords

Topics

Keywords

Computer science
Pipeline (software)
Artificial intelligence
Machine learning
Scale (ratio)
Training set
Image (mathematics)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.

Funding

AN
Agence Nationale de la Recherche
Awards: ANR-19-P3IA-0003, 19-P3IA-0003, ANR-19