CLAP Learning Audio Concepts from Natural Language Supervision

Elizalde, Benjamin; Deshmukh, Soham; Ismail, Mahmoud Al; Wang, Huaming

doi:10.1109/icassp49357.2023.10095889

articleMay 5, 2023Closed access

CLAP Learning Audio Concepts from Natural Language Supervision

BEBenjamin Elizalde SDSoham Deshmukh MAMahmoud Al Ismail HWHuaming Wang

Microsoft Research (United Kingdom)

Indexed incrossref

Abstract

Mainstream machine listening models are trained to learn audio concepts under the paradigm of one class label to many recordings focusing on one task. Learning under such restricted supervision limits the flexibility of models because they require labeled audio for training and can only predict the predefined categories. Instead, we propose to learn audio concepts from natural language supervision. We call our approach Contrastive Language-Audio Pretraining (CLAP), which connects language and audio by using two encoders and a contrastive learning objective, bringing audio and text descriptions into a joint multimodal space. We trained CLAP with 128k audio and text pairs and evaluated it on 16 downstream tasks…

Citation impact

371

total citations

FWCI: 69.28
Percentile: 100%
References: 28

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Computer science
Class (philosophy)
Artificial intelligence
Speech recognition
Natural language processing
Encoder
Audio signal
Speech coding

UN Sustainable Development Goals

Quality Education

No related works found for this paper.