Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Harbin Institute of Technology · Nankai University · +2 more institutions

Indexed incrossref

Abstract

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 50 k hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capability and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as a prompt. Experiment results show that VALL-E…

Citation impact

87
total citations
FWCI
165.36
Percentile
100%
References
107
Citations per year

Authors

13

Topics & keywords

Keywords
  • Computer science
  • Speech recognition
  • Artificial intelligence
  • Codec
  • Zero (linguistics)
  • Speech processing
  • Natural language processing
  • Codec2
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.