VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Zhan, Tong; Song, Yibing; Wang, Jue; Wang, Limin

doi:10.48550/arxiv.2203.12602

preprintarXiv (Cornell University)Mar 23, 2022GREEN OA

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

TZTong Zhan YSYibing Song JWJue Wang LWLimin Wang

Indexed inarxivdatacite

Abstract

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable…

Citation impact

434

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Computer science
Masking (illustration)
Artificial intelligence
Training set
Code (set theory)
Labeled data
Transformer
Pattern recognition (psychology)

No related works found for this paper.