¡@
I would like to report our CGEMM routine on GT200 GPUs. This work is extended work of our SGEMM, which can be downloaded from
http://forums.nvidia.com/index.php?showtopic=159033 .
¡@
Figure 1 shows performance (Gflop/s) of our method on TeslaC1060, GTX285 and GTX295. The baseline is Volkov's code on TeslaC1060 (black dash line).
our method reaches 448 Gflop/s on TeslaC1060 whereas CUBLAS (CUDA 2.3) only reaches 277.7 Gflop/s.
figure 1: comparison between our method and Volkov's code
[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/CGEMM/figure1.JPG[/img]
source code can be downloaded from
http://oz.nthu.edu.tw/~d947207/NVIDIA/CGEMM/lsc_cgemm_v2.zip
technical report: http://oz.nthu.edu.tw/~d947207/NVIDIA/CGEMM/HandTunedCgemm_2010_v1.pdf
source code: http://oz.nthu.edu.tw/~d947207/NVIDIA/CGEMM/lsc_cgemm_v2.zip
¡@