articleJun 1, 2023Closed access

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Beijing Academy of Artificial Intelligence · Shanghai Artificial Intelligence Laboratory · +3 more institutions

Indexed incrossref

Abstract

Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder…

No related works found for this paper.

Funding