VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Indexed inarxivdatacite
Abstract
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable…
Citation impact
434
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
4Topics & keywords
Topics
Keywords
- Computer science
- Masking (illustration)
- Artificial intelligence
- Training set
- Code (set theory)
- Labeled data
- Transformer
- Pattern recognition (psychology)
No related works found for this paper.