Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Smith, Shaden; Patwary, Mostofa; Norick, Brandon; LeGresley, Patrick; Rajbhandari, Samyam; Casper, Jared; Liu, Zhun; Prabhumoye, Shrimai; Zerveas, George; Korthikanti, Vijay Anand; Zhang, Elton; Child, Rewon; Aminabadi, Reza Yazdani; Bernauer, Julie; Xia, Song,; Shoeybi, Mohammad; He, Yuxiong; Houston, Michael E.; Tiwary, Saurabh; Catanzaro, Bryan

doi:10.48550/arxiv.2201.11990

preprintarXiv (Cornell University)Jan 28, 2022GREEN OA

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

SSShaden Smith MPMostofa Patwary BNBrandon Norick PLPatrick LeGresley SRSamyam Rajbhandari

Indexed inarxivdatacite

Abstract

Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used…

Citation impact

299

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

20

Topics & keywords

Topics

Keywords

Computer science
Transformer
Generative grammar
Language model
Artificial intelligence
Natural language processing
Natural language generation
Process (computing)

UN Sustainable Development Goals

Industry, innovation and infrastructure

No related works found for this paper.