¡@

I would like to report our CGEMM routine on GT200 GPUs. This work is extended work of our SGEMM, which can be downloaded from

http://forums.nvidia.com/index.php?showtopic=159033 .

¡@

Figure 1 shows performance (Gflop/s) of our method on TeslaC1060, GTX285 and GTX295. The baseline is Volkov's code on TeslaC1060 (black dash line).

our method reaches 448 Gflop/s on TeslaC1060 whereas CUBLAS (CUDA 2.3) only reaches 277.7 Gflop/s.

figure 1: comparison between our method and Volkov's code

[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/CGEMM/figure1.JPG[/img]

source code can be downloaded from
http://oz.nthu.edu.tw/~d947207/NVIDIA/CGEMM/lsc_cgemm_v2.zip

technical report: http://oz.nthu.edu.tw/~d947207/NVIDIA/CGEMM/HandTunedCgemm_2010_v1.pdf

source code: http://oz.nthu.edu.tw/~d947207/NVIDIA/CGEMM/lsc_cgemm_v2.zip

¡@