From Volkov

From Volkov's paper,
Vasily Volkov, James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra. In SC ’08: Preceedings of the 2008 ACM/IEEE conference on Supercomputing.
Piscataway, NJ, USA, 2008, IEEE Press.
( can be obtained in the thread http://forums.nvidia.com/index.php?showtopic=89084 )

authors show
(1) average time per instruction in cycles (a, b, c are register) for GPUs in Table 1

Table 1:

[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/MAD_pic/table1.JPG[/img]

(2) average time per instruction in cycles (a, b: register, s[i]: shared memory) in Table 2

Table 2:

[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/MAD_pic/table2.JPG[/img]

I focus on latency and throughput of three instructions on TeslaC1060,

one is "a = a *b", one is "a = a*s[i] + c" and another is "a = a * b + c" .

from decuda program,

"a = a *b" is translated to "MAD dest src1 src2"

"a = a*s[i] + c" is translated to "MAD dest [smem] src2 src3"

"a = a*b + c" is translated to "MAD dest src1 src2 src3" and

From my experiment (show later), I conclude
(1) "a = a*b" has latency = 24.7 cycle and throughput 2.5 ~ 4.1 cycle per warp
(this number is very strange, I will show experimental data later)
(2) "a = a*s[i] + c" has latency 34.6 cycle and throughput 6 cycle per warp.
(this number is good and matches result in Table 2)
(3) "a = a*b + c" has latency 31.5 cycle and throughput = ? (maybe 4 cycle per warp)

latency = 31.5 cycle is larger than 24 cycle in Table 1,

Does anyone have verified these numbers?
　

The following is my experiment on TeslaC1060.

Step 1: calibrate "a = a*b"

The "CODE 1" sets "a" as register A_reg and "b" as register b_reg and
compute "A_reg = A_reg * b_reg ;" 256 times.
execution configuration is grid(1,1,1) and threads(NUM_THREADS, 1, 1)
where macro NUM_THREADS = number of threads per block.

CODE 1: kernel of "a = a *b"
[code]
static __global__ void MAD_latency(float * data, timing_stats * timings)
{
int threadNum = threadIdx.x;
unsigned int start_time = 1664;
unsigned int end_time = 1664;

float A_reg = data[0] ;
float b_reg = data[1] ;
__syncthreads();

start_time = clock();
#pragma unroll
for( int j = 0 ; j < 256 ; j++){
A_reg = A_reg * b_reg ;
}
end_time = clock();
__syncthreads();

timings[threadNum].start_time = start_time;
timings[threadNum].end_time = end_time;

if ( 0 == threadNum ){
timings[NUM_THREADS].start_time = (int)A_reg ;
}
}
[/code]

and "DECUDA 1" shows decuda's result of
[code]
start_time = clock();
#pragma unroll
for( int j = 0 ; j < 256 ; j++){
A_reg = A_reg * b_reg ;
}
end_time = clock();
__syncthreads();
[/code]

DECUDA 1: "a = a *b" --> "mul $r3 $r3 $r2"
[img]H:\course\2008summer\c_lang\NVIDIA\MAD_pic\decuda1.jpg[/img]

when kernel's execution, we compute average time over 256 per thread (see "code 2") and
reports minimum time and maximum time in Table 3.

CODE 2: compute average number of cycles per "a = a*b"
[code]
dim3 grid(1, 1, 1);
dim3 threads(NUM_THREADS, 1, 1);
MAD_latency<<< grid, threads >>>(data_gpu, timings_gpu) ;
cudaThreadSynchronize();

CUDA_SAFE_CALL(cudaMemcpy(timings_cpu, timings_gpu, (NUM_THREADS+1)*sizeof(timing_stats), cudaMemcpyDeviceToHost));

// show all time report
int min_time = timings_cpu[0].end_time - timings_cpu[0].start_time ;
int max_time = min_time ;
for( int i = 0 ; i < NUM_THREADS ; i++ ){
int time_gpu = timings_cpu[i].end_time - timings_cpu[i].start_time ;
if ( min_time > time_gpu ) { min_time = time_gpu ; }
if ( max_time < time_gpu ) { max_time = time_gpu ; }
}
[/code]

Table 3: result of "a = a*b"
[img]H:\course\2008summer\c_lang\NVIDIA\MAD_pic\table3.jpg[/img]

From Table 3, NUM_THREADS=1 reports latency = 24.7 cycle, this is consistent with Volkov's result.
To compute throughput, we define throughput = (average time of the instruction)/(number of warp)
and show throughput in Table 4.

Table 4: throughput of "a = a*b"
[img]H:\course\2008summer\c_lang\NVIDIA\MAD_pic\table4.jpg[/img]

Table 4 shows throughput = 2.5 ~ 4.1 cycle per warp, this is very strange

since we have known 24-cycle pipeline latency can be hidden by 6 warps, see Gatt chart in figure 1

Figure 1: pipeline latency
[img]H:\course\2008summer\c_lang\NVIDIA\MAD_pic\figure1.jpg[/img]

and if we invoke more than 6 warps, total time for one instruction is corrected as
total time of one "a=a*b" is (4 cycle) x (number of warps), see figure 2

Figure 2: total time of one "a=a*b"
[img]H:\course\2008summer\c_lang\NVIDIA\MAD_pic\figure2.jpg[/img]

However when threads = 512, throughput is 2.5 cycle per warp, this is much smaller than 4 cycle per warp.

Remark: From test harness of @SPWorley in the thread
http://forums.nvidia.com/index.php?showtopic=103046&pid=570370&mode=threaded&start=#entry570370,
SPWorley uses one block of 192 threads to calibrate "how many clocks it takes".
Under 192 threads, I will say throughput of "a=a*b" is 4.1 cycle per warp.

Step 2: calibrate "a = a*s[i] + c"
The "CODE 3" sets "a" as register A_reg and "s[i]" as shared memory b[i] and
"c" as register c_reg.

CODE 3: kernel of "a = a*s[i] + c"
[code]
static __global__ void MAD_latency(float * data, timing_stats * timings)
{
__shared__ float b[BLOCKSIZE];

int threadNum = threadIdx.x;
unsigned int start_time = 1664;
unsigned int end_time = 1664;

for( int j = threadNum ; j < 16 ; j+=NUM_THREADS){
b[j] = data[j] ;
}
__syncthreads();

float A_reg = data[0] ;
float c_reg = data[2] ;
__syncthreads();

start_time = clock();
#pragma unroll
for( int j = 0 ; j < 16 ; j++){
#pragma unroll
for( int i = 0 ; i < 16 ; i++){
A_reg = A_reg * b[i] + c_reg ;
}
}
end_time = clock();
__syncthreads();

timings[threadNum].start_time = start_time;
timings[threadNum].end_time = end_time;

if ( 0 == threadNum ){
timings[NUM_THREADS].start_time = (int)A_reg ;
}
}
[/code]

and "DECUDA 2" shows decuda's result of
[code]
start_time = clock();
#pragma unroll
for( int j = 0 ; j < 16 ; j++){
#pragma unroll
for( int i = 0 ; i < 16 ; i++){
A_reg = A_reg * b[i] + c_reg ;
}
}
end_time = clock();
__syncthreads();
[/code]

DECUDA 2: "a = a*s[i] + c" --> "mad $r3 s[...] $r3 $r2"
[img]H:\course\2008summer\c_lang\NVIDIA\MAD_pic\decuda2.jpg[/img]

experimental result (see Table 5) shows
Latency of "a = a*s[i] + c" = 34.6 cycle
Throughput of "a = a*s[i] + c" is about 6 cycle per warp, this number matches Table 2.

Table 5: result of "a = a*s[i] + c"
[img]H:\course\2008summer\c_lang\NVIDIA\MAD_pic\table5.jpg[/img]

Step 3: calibrate "a = a * b + c"
The "CODE 4" sets "a, ,b, c" as register A_reg, b_reg and c_reg.

CODE 4: kernel of "a = a * b + c"
[code]
static __global__ void MAD_latency(float * data, timing_stats * timings)
{
int threadNum = threadIdx.x;
unsigned int start_time = 1664;
unsigned int end_time = 1664;

float A_reg = data[0] ;
float b_reg = data[1] ;
float c_reg = data[2] ;
__syncthreads();

start_time = clock();
#pragma unroll
for( int j = 0 ; j < BLOCKSIZE * MAXITE ; j++){
A_reg = A_reg * b_reg + c_reg ;
}
end_time = clock();
__syncthreads();

timings[threadNum].start_time = start_time;
timings[threadNum].end_time = end_time;

if ( 0 == threadNum ){
timings[NUM_THREADS].start_time = (int)A_reg ;
}
}
[/code]

and "DECUDA 3" shows decuda's result of
[code]
start_time = clock();
#pragma unroll
for( int j = 0 ; j < BLOCKSIZE * MAXITE ; j++){
A_reg = A_reg * b_reg + c_reg ;
}
end_time = clock();
__syncthreads();
[/code]

DECUDA 3: "a = a * b + c" --> "mad $r4 $r4 $r3 $r2"
[img]H:\course\2008summer\c_lang\NVIDIA\MAD_pic\decuda3.jpg[/img]

Experimental result (see Table 6) shows latency = 31.5 cycle,

however minimum time and maximum time are much different when NUM_THREADS > 256, I don't know why.

if we focus on NUM_THREADS=192, 224, 256, then throughput is 4 cycle per warp,

and we need 8 warps to hide pipeline latency (=31.5 cycle)

Table 6: result of "a = a * b + c"
[img]H:\course\2008summer\c_lang\NVIDIA\MAD_pic\table6.jpg[/img]