Neural Speech Synthesis with Transformer Network

University of Electronic Science and Technology of China · Microsoft Research Asia (China) · +1 more institution

Indexed incrossref

Abstract

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-theart performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves training efficiency. Meanwhile, any two inputs…

Citation impact

732
total citations
FWCI
56.52
Percentile
100%
References
36
Citations per year

Authors

5

Topics & keywords

Keywords
  • Computer science
  • Transformer
  • Inference
  • Spectrogram
  • Artificial neural network
  • Encoder
  • Machine translation
  • Recurrent neural network
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.

Funding