SSAST: Self-Supervised Audio Spectrogram Transformer
Indexed incrossref
Abstract
Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining…
Citation impact
258
total citations
- FWCI
- 28.93
- Percentile
- 100%
- References
- 54
Citations per year
Authors
4Topics & keywords
Topics
Keywords
- Spectrogram
- Computer science
- Discriminative model
- Speech recognition
- Transformer
- Artificial intelligence
- Convolutional neural network
- Hidden Markov model
UN Sustainable Development Goals
- Reduced inequalities
No related works found for this paper.