Improving performance of matrix multiplication and FFT on GPU | |
Cui, Xiang ; Chen, Yifeng ; Mei, Hong | |
2009 | |
英文摘要 | In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former1, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms. ? 2009 IEEE.; EI; 0 |
语种 | 英语 |
DOI标识 | 10.1109/ICPADS.2009.8 |
内容类型 | 其他 |
源URL | [http://ir.pku.edu.cn/handle/20.500.11897/153569] |
专题 | 信息科学技术学院 |
推荐引用方式 GB/T 7714 | Cui, Xiang,Chen, Yifeng,Mei, Hong. Improving performance of matrix multiplication and FFT on GPU. 2009-01-01. |
个性服务 |
查看访问统计 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论