AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining
University of Surrey · Chinese University of Hong Kong
Abstract
Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a holistic framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework utilizes a general representation of audio, called “language of audio” (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate…
Citation impact
- FWCI
- 46.42
- Percentile
- 100%
- References
- 103
Authors
10Topics & keywords
- Computer science
- Psychology
- Quality Education