AudioLM: A Language Modeling Approach to Audio Generation
Google (Switzerland) · Google (United States)
Abstract
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM…
Citation impact
- FWCI
- 58.77
- Percentile
- 100%
- References
- 92
Authors
11Topics & keywords
- Computer science
- Speech recognition
- Natural language processing
- Audio mining
- Artificial intelligence
- Acoustic model
- Speech processing