EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
Huazhong University of Science and Technology · Beijing Academy of Artificial Intelligence · +2 more institutions
Abstract
We launch EVA, a vision-centric foundation model to Explore the limits of Visual representation at scAle using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that…
Citation impact
- FWCI
- 50.54
- Percentile
- 100%
- References
- 188
Authors
9- YFYuxin FangCorresponding
Huazhong University of Science and Technology, Beijing Academy of Artificial Intelligence
- WWWen Wang
Zhejiang University, Beijing Academy of Artificial Intelligence
- BXBinhui Xie
Beijing Academy of Artificial Intelligence, Beijing Institute of Technology
- QSQuan Sun
Beijing Academy of Artificial Intelligence
- LWLedell Wu
Beijing Academy of Artificial Intelligence
Topics & keywords
- Computer science
- Artificial intelligence
- Segmentation
- Initialization
- Object detection
- Task (project management)
- Image segmentation
- Cognitive neuroscience of visual object recognition
- Quality Education