AudioLM: A Language Modeling Approach to Audio Generation

Borsos, Zalán; Marinier, Raphaël; Vincent, Damien; Kharitonov, Eugene; Pietquin, Olivier; Sharifi, Matt; Roblek, Dominik; Teboul, Olivier; Grangier, David; Tagliasacchi, Marco; Zeghidour, Neil

doi:10.1109/taslp.2023.3288409

articleIEEE/ACM Transactions on Audio Speech and Language ProcessingJan 1, 2023Closed access

AudioLM: A Language Modeling Approach to Audio Generation

ZBZalán Borsos RMRaphaël Marinier DVDamien Vincent EKEugene Kharitonov OPOlivier Pietquin

Google (Switzerland) · Google (United States)

Indexed incrossref

Abstract

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM…

Citation impact

358

total citations

FWCI: 58.77
Percentile: 100%
References: 92

Citations per year

Authors

11

Topics & keywords

Topics

Keywords

Computer science
Speech recognition
Natural language processing
Audio mining
Artificial intelligence
Acoustic model
Speech processing

No related works found for this paper.