HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Chen, Ke; Du, Xingjian; Zhu, Bilei; Ma, Zejun; Berg-Kirkpatrick, Taylor; Dubnov, Shlomo

doi:10.1109/icassp43922.2022.9746312

articleICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)Apr 27, 2022GREEN OA

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

KCKe Chen XDXingjian Du BZBilei Zhu ZMZejun Ma TBTaylor Berg-Kirkpatrick

University of California San Diego · Tencent (China)

Indexed incrossref

Abstract

Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model’s scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e.…

Citation impact

247

total citations

FWCI: 27.24
Percentile: 100%
References: 39

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Computer science
Speech recognition
Security token
Transformer
Sound (geography)
Artificial intelligence
Natural language processing
Acoustics

No related works found for this paper.