Generating Long Sequences with Sparse Transformers
Indexed inarxivdatacite
Abstract
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of…
Citation impact
489
total citations
- FWCI
- —
- Percentile
- —
- References
- 25
Citations per year
Authors
4Topics & keywords
Topics
Keywords
- Initialization
- Quadratic growth
- Computer science
- Byte
- Transformer
- Sequence (biology)
- Sparse matrix
- Architecture
No related works found for this paper.