Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
Carnegie Mellon University · Google (United States)
Abstract
Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than…
Citation impact
- FWCI
- 253.60
- Percentile
- 100%
- References
- 80
Authors
6Topics & keywords
- Perplexity
- Computer science
- Language model
- Transformer
- Treebank
- Artificial intelligence
- Hyperparameter
- Natural language processing
- Quality Education