Generating Long Sequences with Sparse Transformers

Child, Rewon; Gray, Scott; Radford, Alec; Sutskever, Ilya

doi:10.48550/arxiv.1904.10509

preprintarXiv (Cornell University)Apr 23, 2019GREEN OA

Generating Long Sequences with Sparse Transformers

RCRewon Child SGScott Gray ARAlec Radford ISIlya Sutskever

Indexed inarxivdatacite

Abstract

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of…

Citation impact

489

total citations

FWCI: —
Percentile: —
References: 25

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Initialization
Quadratic growth
Computer science
Byte
Transformer
Sequence (biology)
Sparse matrix
Architecture

No related works found for this paper.