SSAST: Self-Supervised Audio Spectrogram Transformer

Indexed incrossref

Abstract

Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining…

Citation impact

258
total citations
FWCI
28.93
Percentile
100%
References
54
Citations per year

Authors

4

Topics & keywords

Keywords
  • Spectrogram
  • Computer science
  • Discriminative model
  • Speech recognition
  • Transformer
  • Artificial intelligence
  • Convolutional neural network
  • Hidden Markov model
UN Sustainable Development Goals
  • Reduced inequalities
No related works found for this paper.