Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, Mohammad; Patwary, Mostofa; Puri, Raul; LeGresley, Patrick; Casper, Jared; Catanzaro, Bryan

doi:10.48550/arxiv.1909.08053

preprintarXiv (Cornell University)Sep 17, 2019GREEN OA

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

MSMohammad Shoeybi MPMostofa Patwary RPRaul Puri PLPatrick LeGresley JCJared Casper

Nvidia (United Kingdom)

Indexed inarxivdatacite

Abstract

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this…

Citation impact

825

total citations

FWCI: —
Percentile: —
References: 46

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Parallelism (grammar)
Training (meteorology)
Computer science
Parallel computing
Data parallelism
Geography
Meteorology

UN Sustainable Development Goals

Quality Education

No related works found for this paper.