data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Indexed inarxivdatacite
Abstract
While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent…
Citation impact
240
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
6Topics & keywords
Topics
Keywords
- Computer science
- Transformer
- Artificial intelligence
- Modalities
- Natural language processing
- Modality (human–computer interaction)
- Natural language
- Speech recognition
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.