Deep learning with COTS HPC systems
Stanford University · Nvidia (United States)
Abstract
Scaling up deep learning algorithms has been shown to lead to increased performance in benchmark tasks and to enable discovery of complex high-level features. Recent efforts to train extremely large networks (with over 1 billion parameters) have relied on cloudlike computing infrastructure and thousands of CPU cores. In this paper, we present technical details and results from our own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI. Our system is able to train 1 billion parameter networks on just 3 machines in a couple of days, and we show that it can scale to networks with over 11 billion parameters using…
Citation impact
- FWCI
- 49.43
- Percentile
- 100%
- References
- 29
Authors
6Topics & keywords
- InfiniBand
- Computer science
- Server
- Supercomputer
- Deep learning
- Benchmark (surveying)
- Artificial neural network
- Distributed computing
- Industry, innovation and infrastructure