SSAST: Self-Supervised Audio Spectrogram Transformer

Gong, Yuan; Lai, Cheng-I; Chung, Yu-An; Glass, James

doi:10.1609/aaai.v36i10.21315

articleProceedings of the AAAI Conference on Artificial IntelligenceJun 28, 2022DIAMOND OA

SSAST: Self-Supervised Audio Spectrogram Transformer

YGYuan Gong CLCheng-I Lai YCYu-An Chung JGJames Glass

Indexed incrossref

Abstract

Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining…

Citation impact

258

total citations

FWCI: 28.93
Percentile: 100%
References: 54

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Spectrogram
Computer science
Discriminative model
Speech recognition
Transformer
Artificial intelligence
Convolutional neural network
Hidden Markov model

UN Sustainable Development Goals

Reduced inequalities

No related works found for this paper.