Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Amodei, Dario; Anubhai, Rishita; Battenberg, Eric; Case, Carl; Casper, Jared; Catanzaro, Bryan; Chen, Jingdong; Chrzanowski, Mike; Coates, Adam; Diamos, Greg; Elsen, Erich; Engel, Jesse; Fan, Linxi; Fougner, Christopher; Han, Tony Xiao; Hannun, Awni; Jun, Billy; LeGresley, Patrick; Lin, Libby; Narang, Sharan; Ng, Andrew; Ozair, Sherjil; Prenger, Ryan; Raiman, Jonathan; Satheesh, Sanjeev; Seetapun, David; Sengupta, Shubho; Wang, Yi; Wang, Zhiqian; Wang, Chong; Xiao, Bo; Yogatama, Dani; Zhan, Jun; Zhu, Zhenyao

doi:10.48550/arxiv.1512.02595

preprintarXiv (Cornell University)Dec 8, 2015GREEN OA

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

DADario Amodei RARishita Anubhai EBEric Battenberg CCCarl Case JCJared Casper

Indexed inarxivdatacite

Abstract

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with…

Citation impact

2,179

total citations

FWCI: —
Percentile: —
References: 51

Citations per year

Authors

34

Topics & keywords

Topics

Keywords

Computer science
End-to-end principle
Mandarin Chinese
Speedup
Speech recognition
Latency (audio)
Low latency (capital markets)
Variety (cybernetics)

No related works found for this paper.