Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Wu, Yusong; Chen, Ke; Zhang, Tianyu; Hui, Yuchen; Berg-Kirkpatrick, Taylor; Dubnov, Shlomo

doi:10.1109/icassp49357.2023.10095969

articleMay 5, 2023GREEN OA

Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

YWYusong Wu KCKe Chen TZTianyu Zhang YHYuchen Hui TBTaylor Berg-Kirkpatrick

Mila - Quebec Artificial Intelligence Institute · University of California, San Diego

Indexed incrossref

Abstract

Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and…

Citation impact

376

total citations

FWCI: 71.18
Percentile: 100%
References: 39

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Computer science
Audio mining
Natural language processing
Pipeline (software)
Speech recognition
Artificial intelligence
Encoder
Construct (python library)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.