Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
Mila - Quebec Artificial Intelligence Institute · University of California, San Diego
Abstract
Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and…
Citation impact
- FWCI
- 71.18
- Percentile
- 100%
- References
- 39
Authors
6Topics & keywords
- Computer science
- Audio mining
- Natural language processing
- Pipeline (software)
- Speech recognition
- Artificial intelligence
- Encoder
- Construct (python library)
- Quality Education