Audioclip: Extending Clip to Image, Text and Audio

Indexed incrossref

Abstract

The rapidly evolving field of sound classification has greatly benefited from the methods of other domains. Today, the trend is to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models.We present AudioCLIP – an extension of the CLIP model that handles audio in addition to text and images. Utilizing the AudioSet dataset, our proposed model incorporates the ESResNeXt audio-model into the CLIP framework, thus enabling it to perform multimodal classification and keeping CLIP’s zero-shot capabilities.AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task and out-performs others by reaching accuracies of 97.15 % on…

Citation impact

279
total citations
FWCI
31.37
Percentile
100%
References
51
Citations per year

Authors

4

Topics & keywords

Keywords
  • Computer science
  • Task (project management)
  • Field (mathematics)
  • Shot (pellet)
  • Code (set theory)
  • Fuse (electrical)
  • Image (mathematics)
  • Artificial intelligence
UN Sustainable Development Goals
  • Sustainable cities and communities
No related works found for this paper.