I would like to report our SGEMM routine on GT200 GPUs

I would like to report our CGEMM routine on GT200 GPUs. This work is extended work of our SGEMM, which can be downloaded from

Figure 1 shows performance (Gflop/s) of our method on TeslaC1060, GTX285 and GTX295. The baseline is Volkov's code on TeslaC1060 (black dash line).

our method reaches 448 Gflop/s on TeslaC1060 whereas CUBLAS (CUDA 2.3) only reaches 277.7 Gflop/s.

[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/CGEMM/figure1.JPG[/img]

source code can be downloaded from
http://oz.nthu.edu.tw/~d947207/NVIDIA/CGEMM/lsc_cgemm_v2.zip

technical report: http://oz.nthu.edu.tw/~d947207/NVIDIA/CGEMM/HandTunedCgemm_2010_v1.pdf