WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
University of Surrey · Johns Hopkins University · +2 more institutions
Abstract
The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years, yet the limited size of existing audio-language datasets poses challenges for researchers due to the costly and time-consuming collection process. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400 k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a…
Citation impact
- FWCI
- 38.88
- Percentile
- 100%
- References
- 106
Authors
9Topics & keywords
- Closed captioning
- Computer science
- Speech recognition
- Audio analyzer
- Natural language processing
- Linguistics
- Artificial intelligence
- Audio signal