Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
Indexed inarxivdatacite
Abstract
Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used…
Citation impact
299
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
20Topics & keywords
Topics
Keywords
- Computer science
- Transformer
- Generative grammar
- Language model
- Artificial intelligence
- Natural language processing
- Natural language generation
- Process (computing)
UN Sustainable Development Goals
- Industry, innovation and infrastructure
No related works found for this paper.