pipeline latency

In paper [1], authors measure pipeline latency on several graphic card and reports in table 1.

[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/pipeline_latency/pipeline_latency_table.JPG[/img]

the table shows that register-to-register MAD (multiply-and-add) instruction runs at 24 cycles.

and authors argue "24 cycle latency may be hidden by running simultaneously 6 warps (or 192 threads) per SM".

this match description section 5.1.2.6 n programming guide,

"Generally, accessing a register is zero extra clock cycles per instruction, but delays may occur due to register read-after-write dependencies and register memory bank conflicts.

The delays introduced by read-after-write dependencies can be ignored as soon as there are at least 192 active threads per multiprocessor to hide them"

Question: how does scheduler dispatch warps in a SM? Two methods,

Method 1 : Warp occupies SPs till memory-access instruction is executed.

Method 2 : Each warp execute one instruction in turn.

In section 4.1 of programming guide, it says “Every instruction issue time, the SIMT unit selects a warp that is ready to execute and issues the next instruction to the active threads of the warp”.

It seems that hardware supports method 2.

I take an example to show method 1 and method 2.

Example : execute three instructions S1, S2 and S3 in turn

[code]

S1 : a ß a * b + c ; // register read-after-write dependence

S2 : a ß a * b + c ; // register read-after-write dependence

S3 : odata[index] ß a ;// read operation

[/code]

we show Gatt chart of method 1 in figure 1 and Gatt chart of method 2 in figure 2.

figure 1, pipeline_latency_1.jpg

[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/pipeline_latency/pipeline_latency_1.JPG[/img]

figure 2, pipeline_latency_2.jpg

[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/pipeline_latency/pipeline_latency_2.JPG[/img]

Reference: [1] Vasily Volkov, James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra

=======================================================================================

Although method 2 is right, but it is tedious to draw Gatt chart under method 2.

I don’t adopt method 2 to draw Gatt chart when compare bandwidth between “float” and “double” in the thread

http://forums.nvidia.com/index.php?showtopic=106924&pid=600634&mode=threaded&start=#entry600634.

In that topic, I don’t use pipeline latency but calibrate “index computation” via Block-wise test harness provided by SPWorley.

in the thread http://forums.nvidia.com/index.php?showtopic=103046 , @SPWorley uses one block of 192 threads to calibrate

"how many clocks the operation takes".

following code is kernel of calibration.

[code]

#define ITEST(num) \

ikernel<itest_ ## num, EVALLOOPS> <<<1,192>>>(12345, d_result); \

cudaMemcpy( &h_result, d_result, sizeof(int), cudaMemcpyDeviceToHost); \

printf("I" #num " %4.1lf %s\n", \

8*(h_result-ibase+1.0)/(192*EVALLOOPS*UNROLLCOUNT), \

itest_ ## num ().describe());

[/code]

@SPWorley's code uses

(1) 192 threads to hide pipeline latency and

(2) unroll large loop

Question: what is relationship between pipeline latency and SPWorley uses one block of 192 threads to calibrate “how many clocks it takes”.

first suppose we want to evaluate operation S1 ( a <-- a * b + c ), then we must do S1 large times, say M times.

[code]

for i = 1: M

S1 : a <-- a * b + c ; // register read-after-write dependence

end

S2 : a <-- a * b + c ; // register read-after-write dependence

[/code]

Then it is easy to plot Gatt chart of above code, just modify Gatt chart in figure 2, repeat operation S1 M times , see figure 3.

figure 3, pipeline_latency_3.jpg

[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/pipeline_latency/pipeline_latency_3.JPG[/img]

if M is large enough, then average execution time of S1 on one SM is about 1 cycle. (when all 8 SP executes S1 simultaneously, it only needs 1 cycle to complete S1)

similarly if we want to calibrate operation "S1 + S2" in the following code,

[code]

for i = 1: M

S1 : a <-- a * b + c ; // register read-after-write dependence

S2 : a <-- a * b + c ; // register read-after-write dependence

end

S3 : odata[index] <-- a ; // write operation

[/code]

then Gatt chart is figure 4. average execution time of S1+S2 on one SM is about 2 cycle

figure 4, pipeline_latency_4.jpg

[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/pipeline_latency/pipeline_latency_4.JPG[/img]

Average execution time of S1 on one SM = 1 cycle, this means that one warp needs 4 cycle to execute S1 instruction

in average sense. We define method 3 as method 1 but with average execution time of instructions.

Then Gatt chart of following code is figure 5

[code]

//Example : execute three instructions S1, S2 and S3 in turn

S1 : a <-- a * b + c ; // register read-after-write dependence

S2 : a <-- a * b + c ; // register read-after-write dependence

S3 : odata[index] <-- a ; // write operation

S4: a <-- a * b + c ;

[/code]

figure 5, pipeline_latency_5.jpg

[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/pipeline_latency/pipeline_latency_5.JPG[/img]

however if we use method 2, then Gatt chart is figure 6

figure 6, pipeline_latency_6.jpg

[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/pipeline_latency/pipeline_latency_6.JPG[/img]

Observation: difference between method 2 and method 3

(1) Method 3 hides "index computation" in memory latency while method 2 hide "index computation" in pipeline latency.

(2) Space between two read/write operation (red rectangle) is larger in method 3.

However critical timing of method 2 and method 3 are the same, so we can use method 3 to plot Gatt chart.

To sum up, I think that it is reasonable to draw Gatt chart by method 3, which is more simple.