preprintarXiv (Cornell University)Sep 17, 2019GREEN OA

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Nvidia (United Kingdom)

Indexed inarxivdatacite

Abstract

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this…

Citation impact

825
total citations
FWCI
Percentile
References
46
Citations per year

Authors

6

Topics & keywords

Keywords
  • Parallelism (grammar)
  • Training (meteorology)
  • Computer science
  • Parallel computing
  • Data parallelism
  • Geography
  • Meteorology
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.