Audioclip: Extending Clip to Image, Text and Audio

Guzhov, Andrey; Raue, Federico; Hees, J.J. van; Dengel, Andreas

doi:10.1109/icassp43922.2022.9747631

articleICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)Apr 27, 2022Closed access

Audioclip: Extending Clip to Image, Text and Audio

AGAndrey Guzhov FRFederico Raue JVJ.J. van Hees ADAndreas Dengel

Indexed incrossref

Abstract

The rapidly evolving field of sound classification has greatly benefited from the methods of other domains. Today, the trend is to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models.We present AudioCLIP – an extension of the CLIP model that handles audio in addition to text and images. Utilizing the AudioSet dataset, our proposed model incorporates the ESResNeXt audio-model into the CLIP framework, thus enabling it to perform multimodal classification and keeping CLIP’s zero-shot capabilities.AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task and out-performs others by reaching accuracies of 97.15 % on…

Citation impact

279

total citations

FWCI: 31.37
Percentile: 100%
References: 51

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Computer science
Task (project management)
Field (mathematics)
Shot (pellet)
Code (set theory)
Fuse (electrical)
Image (mathematics)
Artificial intelligence

UN Sustainable Development Goals

Sustainable cities and communities

No related works found for this paper.