FastSpeech: Fast, Robust and Controllable Text to Speech

Ren, Yi; Ruan, Yangjun; Tan, Xu; Qin, Tao; Zhao, Sheng; Zhao, Zhou; Liu, Tie‐Yan

doi:10.48550/arxiv.1905.09263

preprintarXiv (Cornell University)May 22, 2019GREEN OA

FastSpeech: Fast, Robust and Controllable Text to Speech

YRYi Ren YRYangjun Ruan XTXu Tan TQTao Qin SZSheng Zhao

Zhejiang University · Microsoft Research (United Kingdom)

Indexed inarxivdatacite

Abstract

Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.…

Citation impact

580

total citations

FWCI: —
Percentile: —
References: 26

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Spectrogram
Speech recognition
Computer science
Speech synthesis
Autoregressive model
Encoder
Parametric statistics
Artificial neural network

UN Sustainable Development Goals

Quality Education

No related works found for this paper.