Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chen, Sanyuan; Wang, Chengyi; Wu, Yu; Zhang, Ziqiang; Zhou, Long; Liu, Shujie; Chen, Zhuo; Liu, Tie‐Yan; Wang, Huaming; Li, Jinyu; He, Lei; Zhao, Sheng; Wei, Furu

doi:10.1109/taslpro.2025.3530270

articleIEEE Transactions on Audio Speech and Language ProcessingJan 1, 2025Closed access

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

SCSanyuan Chen CWChengyi Wang YWYu Wu ZZZiqiang Zhang LZLong Zhou

Harbin Institute of Technology · Nankai University · +2 more institutions

Indexed incrossref

Abstract

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 50 k hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capability and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as a prompt. Experiment results show that VALL-E…

Citation impact

87

total citations

FWCI: 165.36
Percentile: 100%
References: 107

Citations per year

Authors

13

Topics & keywords

Topics

Keywords

Computer science
Speech recognition
Artificial intelligence
Codec
Zero (linguistics)
Speech processing
Natural language processing
Codec2

UN Sustainable Development Goals

Quality Education

No related works found for this paper.