preprintarXiv (Cornell University)Feb 7, 2022GREEN OA

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Indexed inarxivdatacite

Abstract

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent…

Citation impact

240
total citations
FWCI
Percentile
References
0
Citations per year

Authors

6

Topics & keywords

Keywords
  • Computer science
  • Transformer
  • Artificial intelligence
  • Modalities
  • Natural language processing
  • Modality (human–computer interaction)
  • Natural language
  • Speech recognition
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.