Diffsound: Discrete Diffusion Model for Text-to-Sound Generation
Peking University · Bellevue Hospital Center
Abstract
Generating sound effects that people want is an important topic. However, there are limited studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a token-decoder, and a vocoder. The framework first uses the token-decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform. We found that the token-decoder significantly influences the generation performance. Thus, we…
Citation impact
- FWCI
- 35.59
- Percentile
- 100%
- References
- 87
Authors
7Topics & keywords
- Spectrogram
- Security token
- Computer science
- Speech recognition
- Autoregressive model
- Encoder
- Focus (optics)
- Waveform