NaturalSpeech : End-to-End Text-to-Speech Synthesis With Human-Level Quality
Peking University · Microsoft Research Asia (China) · +1 more institution
Abstract
Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on benchmark datasets. Specifically, we leverage a variational auto-encoder (VAE) for end-to-end text-to-waveform generation, with several key modules to enhance the capacity of the prior from text…
Citation impact
- FWCI
- 44.57
- Percentile
- 100%
- References
- 57
Authors
14- XTXu TanCorresponding
Peking University, Microsoft Research Asia (China)
- JCJiawei Chen
Microsoft Research Asia (China)
- HLHaohe Liu
University of Surrey, Microsoft Research Asia (China)
- JCJian Cong
Microsoft Research Asia (China)
- CZChen Zhang
Microsoft Research Asia (China)
Topics & keywords
- Mean opinion score
- Computer science
- Wilcoxon signed-rank test
- End-to-end principle
- Quality Score
- Speech recognition
- Leverage (statistics)
- Encoder