Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Harbin Institute of Technology · Nankai University · +2 more institutions
Abstract
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 50 k hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capability and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as a prompt. Experiment results show that VALL-E…
Citation impact
- FWCI
- 165.36
- Percentile
- 100%
- References
- 107
Authors
13Topics & keywords
- Computer science
- Speech recognition
- Artificial intelligence
- Codec
- Zero (linguistics)
- Speech processing
- Natural language processing
- Codec2
- Quality Education