Diffsound: Discrete Diffusion Model for Text-to-Sound Generation

Peking University · Bellevue Hospital Center

Indexed incrossref

Abstract

Generating sound effects that people want is an important topic. However, there are limited studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a token-decoder, and a vocoder. The framework first uses the token-decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform. We found that the token-decoder significantly influences the generation performance. Thus, we…

Citation impact

189
total citations
FWCI
35.59
Percentile
100%
References
87
Citations per year

Authors

7

Topics & keywords

Keywords
  • Spectrogram
  • Security token
  • Computer science
  • Speech recognition
  • Autoregressive model
  • Encoder
  • Focus (optics)
  • Waveform
No related works found for this paper.