preprintarXiv (Cornell University)Apr 23, 2019GREEN OA

Generating Long Sequences with Sparse Transformers

Indexed inarxivdatacite

Abstract

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of…

Citation impact

489
total citations
FWCI
Percentile
References
25
Citations per year

Authors

4

Topics & keywords

Keywords
  • Initialization
  • Quadratic growth
  • Computer science
  • Byte
  • Transformer
  • Sequence (biology)
  • Sparse matrix
  • Architecture
No related works found for this paper.