CNN architectures for large-scale audio classification

Hershey, Shawn; Chaudhuri, Sourish; Ellis, Daniel P. W.; Gemmeke, Jort F.; Jansen, Aren; Moore, Robert C.; Plakal, Manoj; Platt, Devin; Saurous, Rif A.; Seybold, Bryan; Slaney, Malcolm; Weiss, Ron J.; Wilson, Kevin

doi:10.1109/icassp.2017.7952132

articleMar 1, 2017Closed access

CNN architectures for large-scale audio classification

SHShawn Hershey SCSourish Chaudhuri DPDaniel P. W. Ellis JFJort F. Gemmeke AJAren Jansen

Google (United States) · Mountain View College

Indexed incrossref

Abstract

Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5]…

Citation impact

2,437

total citations

FWCI: 129.18
Percentile: 100%
References: 33

Citations per year

Authors

13

Topics & keywords

Topics

Keywords

Computer science
Convolutional neural network
Task (project management)
Contextual image classification
Multi-label classification
Artificial intelligence
Set (abstract data type)
Vocabulary

UN Sustainable Development Goals

Quality Education

No related works found for this paper.