AudioLM: A Language Modeling Approach to Audio Generation

Google (Switzerland) · Google (United States)

Indexed incrossref

Abstract

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM…

Citation impact

358
total citations
FWCI
58.77
Percentile
100%
References
92
Citations per year

Authors

11

Topics & keywords

Keywords
  • Computer science
  • Speech recognition
  • Natural language processing
  • Audio mining
  • Artificial intelligence
  • Acoustic model
  • Speech processing
No related works found for this paper.