VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Beijing Academy of Artificial Intelligence · Shanghai Artificial Intelligence Laboratory · +3 more institutions
Abstract
Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder…
Citation impact
- FWCI
- 46.23
- Percentile
- 100%
- References
- 120
Authors
8- LWLimin WangCorresponding
Beijing Academy of Artificial Intelligence, Shanghai Artificial Intelligence Laboratory, Nanjing University
- BHBingkun Huang
Beijing Academy of Artificial Intelligence, Shanghai Artificial Intelligence Laboratory, Nanjing University
- ZZZhiyu Zhao
Beijing Academy of Artificial Intelligence, Shanghai Artificial Intelligence Laboratory, Nanjing University
- TZTong Zhan
Nanjing University
- YHYinan He
SAIC-GM (China), Beijing Academy of Artificial Intelligence, Shanghai Artificial Intelligence Laboratory
Topics & keywords
- Computer science
- Encoder
- Autoencoder
- Scalability
- Masking (illustration)
- Artificial intelligence
- Variety (cybernetics)
- Speech recognition