FastSpeech: Fast, Robust and Controllable Text to Speech
Indexed indatacite
Abstract
Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.…
Citation impact
259
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
1Topics & keywords
Topics
Keywords
- Spectrogram
- Speech recognition
- Computer science
- Speech synthesis
- Autoregressive model
- Encoder
- Artificial neural network
- Parametric statistics
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.