FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Ren, Yi

doi:10.48550/arxiv.2006.04558

preprintarXiv (Cornell University)Jun 8, 2020GREEN OA

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

YRYi Ren

Zhejiang University

Indexed inarxivdatacite

Abstract

Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target…

Citation impact

513

total citations

FWCI: —
Percentile: —
References: 40

Citations per year

Authors

1

YR
Yi RenCorresponding
Zhejiang University

Topics & keywords

Topics

Keywords

Computer science
Autoregressive model
Speech recognition
Inference
Waveform
Spectrogram
Speech synthesis
Duration (music)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.