preprintarXiv (Cornell University)Jan 28, 2022GREEN OA

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Indexed inarxivdatacite

Abstract

Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used…

Citation impact

299
total citations
FWCI
Percentile
References
0
Citations per year

Authors

20

Topics & keywords

Keywords
  • Computer science
  • Transformer
  • Generative grammar
  • Language model
  • Artificial intelligence
  • Natural language processing
  • Natural language generation
  • Process (computing)
UN Sustainable Development Goals
  • Industry, innovation and infrastructure
No related works found for this paper.