AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining

University of Surrey · Chinese University of Hong Kong

Indexed incrossref

Abstract

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a holistic framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework utilizes a general representation of audio, called “language of audio” (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate…

Citation impact

145
total citations
FWCI
46.42
Percentile
100%
References
103
Citations per year

Authors

10

Topics & keywords

Keywords
  • Computer science
  • Psychology
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.

Funding