NaturalSpeech : End-to-End Text-to-Speech Synthesis With Human-Level Quality

Tan, Xu; Chen, Jiawei; Liu, Haohe; Cong, Jian; Zhang, Chen; Liu, Yanqing; Wang, Xi; Leng, Yichong; Yi, Yuanhao; He, Lei; Zhao, Sheng; Qin, Tao; Soong, Frank K.; Liu, Tie‐Yan

doi:10.1109/tpami.2024.3356232

articleIEEE Transactions on Pattern Analysis and Machine IntelligenceJan 19, 2024Closed access

NaturalSpeech : End-to-End Text-to-Speech Synthesis With Human-Level Quality

XTXu TanJCJiawei ChenHLHaohe LiuJCJian CongCZChen Zhang

Peking University · Microsoft Research Asia (China) · +1 more institution

PubMed

Indexed incrossrefpubmed

Abstract

Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on benchmark datasets. Specifically, we leverage a variational auto-encoder (VAE) for end-to-end text-to-waveform generation, with several key modules to enhance the capacity of the prior from text…

Citation impact

145

total citations

FWCI: 44.57
Percentile: 100%
References: 57

Citations per year

Authors

14

XT
Xu TanCorresponding
Peking University, Microsoft Research Asia (China)
JC
Jiawei Chen
Microsoft Research Asia (China)
HL
Haohe Liu
University of Surrey, Microsoft Research Asia (China)
JC
Jian Cong
Microsoft Research Asia (China)
CZ
Chen Zhang
Microsoft Research Asia (China)

Topics & keywords

Topics

Keywords

Mean opinion score
Computer science
Wilcoxon signed-rank test
End-to-end principle
Quality Score
Speech recognition
Leverage (statistics)
Encoder

No related works found for this paper.