Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Indexed inarxivdatacite
Abstract
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this…
Citation impact
825
total citations
- FWCI
- —
- Percentile
- —
- References
- 46
Citations per year
Authors
6Topics & keywords
Topics
Keywords
- Parallelism (grammar)
- Training (meteorology)
- Computer science
- Parallel computing
- Data parallelism
- Geography
- Meteorology
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.