VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Wang, Limin; Huang, Bingkun; Zhao, Zhiyu; Zhan, Tong; He, Yinan; Wang, Yi; Wang, Yali; Qiao, Yu

doi:10.1109/cvpr52729.2023.01398

articleJun 1, 2023Closed access

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

LWLimin Wang BHBingkun Huang ZZZhiyu Zhao TZTong Zhan YHYinan He

Beijing Academy of Artificial Intelligence · Shanghai Artificial Intelligence Laboratory · +3 more institutions

Indexed incrossref

Abstract

Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder…

Citation impact

407

total citations

FWCI: 46.23
Percentile: 100%
References: 120

Citations per year

Authors

8

Topics & keywords

Topics

Keywords

Computer science
Encoder
Autoencoder
Scalability
Masking (illustration)
Artificial intelligence
Variety (cybernetics)
Speech recognition

No related works found for this paper.