NaturalSpeech : End-to-End Text-to-Speech Synthesis With Human-Level Quality

Peking University · Microsoft Research Asia (China) · +1 more institution

PubMed
Indexed incrossrefpubmed

Abstract

Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on benchmark datasets. Specifically, we leverage a variational auto-encoder (VAE) for end-to-end text-to-waveform generation, with several key modules to enhance the capacity of the prior from text…

Citation impact

145
total citations
FWCI
44.57
Percentile
100%
References
57
Citations per year

Authors

14

Topics & keywords

Keywords
  • Mean opinion score
  • Computer science
  • Wilcoxon signed-rank test
  • End-to-end principle
  • Quality Score
  • Speech recognition
  • Leverage (statistics)
  • Encoder
No related works found for this paper.

Funding