data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Baevski, Alexei; Hsu, Wei-Ning; Xu, Qiantong; Babu, Arun; Gu, Jiatao; Auli, Michael

doi:10.48550/arxiv.2202.03555

preprintarXiv (Cornell University)Feb 7, 2022GREEN OA

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

ABAlexei Baevski WHWei-Ning Hsu QXQiantong Xu ABArun Babu JGJiatao Gu

Indexed inarxivdatacite

Abstract

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent…

Citation impact

240

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Computer science
Transformer
Artificial intelligence
Modalities
Natural language processing
Modality (human–computer interaction)
Natural language
Speech recognition

UN Sustainable Development Goals

Quality Education

No related works found for this paper.