CNN architectures for large-scale audio classification
Google (United States) · Mountain View College
Abstract
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5]…
Citation impact
- FWCI
- 129.18
- Percentile
- 100%
- References
- 33
Authors
13- SHShawn HersheyCorresponding
Google (United States), Mountain View College
- SCSourish Chaudhuri
Google (United States), Mountain View College
- DPDaniel P. W. Ellis
Google (United States), Mountain View College
- JFJort F. Gemmeke
Google (United States), Mountain View College
- AJAren Jansen
Mountain View College, Google (United States)
Topics & keywords
- Computer science
- Convolutional neural network
- Task (project management)
- Contextual image classification
- Multi-label classification
- Artificial intelligence
- Set (abstract data type)
- Vocabulary
- Quality Education