WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

Mei, Xinhao; Meng, Chutong; Liu, Haohe; Kong, Qiuqiang; Ko, Tom; Zhao, Chengqi; Plumbley, Mark D.; Zou, Yuexian; Wang, Wenwu

doi:10.1109/taslp.2024.3419446

articleIEEE/ACM Transactions on Audio Speech and Language ProcessingJan 1, 2024Closed access

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

XMXinhao Mei CMChutong Meng HLHaohe Liu QKQiuqiang Kong TKTom Ko

University of Surrey · Johns Hopkins University · +2 more institutions

Indexed incrossref

Abstract

The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years, yet the limited size of existing audio-language datasets poses challenges for researchers due to the costly and time-consuming collection process. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400 k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a…

Citation impact

123

total citations

FWCI: 38.88
Percentile: 100%
References: 106

Citations per year

Authors

9

Topics & keywords

Topics

Keywords

Closed captioning
Computer science
Speech recognition
Audio analyzer
Natural language processing
Linguistics
Artificial intelligence
Audio signal

No related works found for this paper.

Funding

EA
Engineering and Physical Sciences Research Council
Award: EP/T019751/1