preprintarXiv (Cornell University)Mar 23, 2022GREEN OA

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Indexed inarxivdatacite

Abstract

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable…

Citation impact

434
total citations
FWCI
Percentile
References
0
Citations per year

Authors

4

Topics & keywords

Keywords
  • Computer science
  • Masking (illustration)
  • Artificial intelligence
  • Training set
  • Code (set theory)
  • Labeled data
  • Transformer
  • Pattern recognition (psychology)
No related works found for this paper.

Funding