AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining

Liu, Haohe; Yuan, Yi; Liu, Xubo; Mei, Xinhao; Kong, Qiuqiang; Tian, Qiao; Wang, Yu-Ping; Wang, Wenwu; Wang, Yuxuan; Plumbley, Mark D.

doi:10.1109/taslp.2024.3399607

articleIEEE/ACM Transactions on Audio Speech and Language ProcessingJan 1, 2024Closed access

AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining

HLHaohe Liu YYYi Yuan XLXubo Liu XMXinhao Mei QKQiuqiang Kong

University of Surrey · Chinese University of Hong Kong

Indexed incrossref

Abstract

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a holistic framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework utilizes a general representation of audio, called “language of audio” (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate…

Citation impact

145

total citations

FWCI: 46.42
Percentile: 100%
References: 103

Citations per year

Authors

10

Topics & keywords

Topics

Keywords

Computer science
Psychology

UN Sustainable Development Goals

Quality Education

No related works found for this paper.

Funding

EA
Engineering and Physical Sciences Research Council
Awards: EP/T019751/1, EP/T019751/1