articleNov 15, 2008Closed access
Benchmarking GPUs to tune dense linear algebra
University of California, Berkeley
Abstract
We present performance results for dense linear algebra using recent NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs up to 60 % faster than the vendor’s implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80–90 % of the peak GEMM rate. Our parallel LU running on two GPUs achieves up to ~540 Gflop/s. These results are accomplished by challenging the accepted view of the GPU architecture and programming guidelines. We argue that modern GPUs should be viewed as multithreaded multicore vector units. We exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU. This study includes…
Citation impact
726
total citations
- FWCI
- 97.39
- Percentile
- 100%
- References
- 18
Citations per year
Authors
2Topics & keywords
Topics
Keywords
- Parallel computing
- Computer science
- Linear algebra
- Benchmarking
- Cholesky decomposition
- CUDA
- Multi-core processor
- Kernel (algebra)
No related works found for this paper.