Diffsound: Discrete Diffusion Model for Text-to-Sound Generation

Yang, Dongchao; Yu, Jianwei; Wang, Helin; Wang, Wen; Weng, Chao; Zou, Yuexian; Yu, Dong

doi:10.1109/taslp.2023.3268730

articleIEEE/ACM Transactions on Audio Speech and Language ProcessingJan 1, 2023Closed access

Diffsound: Discrete Diffusion Model for Text-to-Sound Generation

DYDongchao Yang JYJianwei Yu HWHelin Wang WWWen Wang CWChao Weng

Peking University · Bellevue Hospital Center

Indexed incrossref

Abstract

Generating sound effects that people want is an important topic. However, there are limited studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a token-decoder, and a vocoder. The framework first uses the token-decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform. We found that the token-decoder significantly influences the generation performance. Thus, we…

Citation impact

189

total citations

FWCI: 35.59
Percentile: 100%
References: 87

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Spectrogram
Security token
Computer science
Speech recognition
Autoregressive model
Encoder
Focus (optics)
Waveform

No related works found for this paper.