¡@
pipeline latency
In paper [1], authors measure pipeline latency on several graphic card and reports in table 1.
table 1, pipeline_latency_table.jpg
[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/pipeline_latency/pipeline_latency_table.JPG[/img]
the table shows that register-to-register MAD (multiply-and-add) instruction runs at 24 cycles.
¡@
and authors argue "24 cycle latency may be hidden by running simultaneously 6 warps (or 192 threads) per SM".
¡@
this match description section 5.1.2.6 n programming guide,
¡@
"Generally, accessing a register is zero extra clock cycles per instruction, but delays may occur due to register read-after-write dependencies and register memory bank conflicts.
The delays introduced by read-after-write dependencies can be ignored as soon as there are at least 192 active threads per multiprocessor to hide them"
¡@
Question: how does scheduler dispatch warps in a SM? Two methods,
Method 1 : Warp occupies SPs till memory-access instruction is executed.
Method 2 : Each warp execute one instruction in turn.
¡@
In section 4.1 of programming guide, it says ¡§Every instruction issue time, the SIMT unit selects a warp that is ready to execute and issues the next instruction to the active threads of the warp¡¨.
It seems that hardware supports method 2.
¡@
I take an example to show method 1 and method 2.
Example : execute three instructions S1, S2 and S3 in turn
[code]
S1 : a ß a * b + c ; // register read-after-write dependence
S2 : a ß a * b + c ; // register read-after-write dependence
S3 : odata[index] ß a ;// read operation
[/code]
¡@
we show Gatt chart of method 1 in figure 1 and Gatt chart of method 2 in figure 2.
¡@
figure 1, pipeline_latency_1.jpg
[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/pipeline_latency/pipeline_latency_1.JPG[/img]
¡@
figure 2, pipeline_latency_2.jpg
[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/pipeline_latency/pipeline_latency_2.JPG[/img]
¡@
¡@
¡@
Reference: [1] Vasily Volkov, James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra
¡@
=======================================================================================
¡@
Although method 2 is right, but it is tedious to draw Gatt chart under method 2.
I don¡¦t adopt method 2 to draw Gatt chart when compare bandwidth between ¡§float¡¨ and ¡§double¡¨ in the thread
http://forums.nvidia.com/index.php?showtopic=106924&pid=600634&mode=threaded&start=#entry600634.
In that topic, I don¡¦t use pipeline latency but calibrate ¡§index computation¡¨ via Block-wise test harness provided by SPWorley.
¡@
¡@
in the thread http://forums.nvidia.com/index.php?showtopic=103046 , @SPWorley uses one block of 192 threads to calibrate
"how many clocks the operation takes".
¡@
following code is kernel of calibration.
[code]
#define
ITEST(num) \ikernel<itest_
## num, EVALLOOPS> <<<1,192>>>(12345, d_result); \cudaMemcpy( &h_result, d_result,
sizeof(int), cudaMemcpyDeviceToHost); \printf(
"I" #num " %4.1lf %s\n", \8*(h_result-ibase+1.0)/(192*EVALLOOPS*UNROLLCOUNT), \
itest_
## num ().describe());[/code]
¡@
¡@
@SPWorley's code uses
(1) 192 threads to hide pipeline latency and
(2) unroll large loop
¡@
Question: what is relationship between pipeline latency and SPWorley uses one block of 192 threads to calibrate ¡§how many clocks it takes¡¨.
¡@
first suppose we want to evaluate operation S1 ( a <-- a * b + c ), then we must do S1 large times, say M times.
[code]
for i = 1: M
S1 : a <-- a * b + c ; // register read-after-write dependence
end
S2 : a <-- a * b + c ; // register read-after-write dependence
[/code]
¡@
Then it is easy to plot Gatt chart of above code, just modify Gatt chart in figure 2, repeat operation S1 M times , see figure 3.
¡@
figure 3, pipeline_latency_3.jpg
[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/pipeline_latency/pipeline_latency_3.JPG[/img]
¡@
if M is large enough, then average execution time of S1 on one SM is about 1 cycle. (when all 8 SP executes S1 simultaneously, it only needs 1 cycle to complete S1)
¡@
similarly if we want to calibrate operation "S1 + S2" in the following code,
[code]
for i = 1: M
S1 : a <-- a * b + c ; // register read-after-write dependence
S2 : a <-- a * b + c ; // register read-after-write dependence
end
S3 : odata[index] <-- a ; // write operation
[/code]
then Gatt chart is figure 4. average execution time of S1+S2 on one SM is about 2 cycle
figure 4, pipeline_latency_4.jpg
[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/pipeline_latency/pipeline_latency_4.JPG[/img]
Average execution time of S1 on one SM = 1 cycle, this means that one warp needs 4 cycle to execute S1 instruction
in average sense. We define method 3 as method 1 but with average execution time of instructions.
¡@
Then Gatt chart of following code is figure 5
[code]
//Example : execute three instructions S1, S2 and S3 in turn
S1 : a <-- a * b + c ; // register read-after-write dependence
S2 : a <-- a * b + c ; // register read-after-write dependence
S3 : odata[index] <-- a ; // write operation
S4: a <-- a * b + c ;
[/code]
¡@
figure 5, pipeline_latency_5.jpg
[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/pipeline_latency/pipeline_latency_5.JPG[/img]
¡@
however if we use method 2, then Gatt chart is figure 6
¡@
figure 6, pipeline_latency_6.jpg
[img]http://oz.nthu.edu.tw/~d947207/NVIDIA/pipeline_latency/pipeline_latency_6.JPG[/img]
¡@
Observation: difference between method 2 and method 3
¡@
(1) Method 3 hides "index computation" in memory latency while method 2 hide "index computation" in pipeline latency.
(2) Space between two read/write operation (red rectangle) is larger in method 3.
¡@
However critical timing of method 2 and method 3 are the same, so we can use method 3 to plot Gatt chart.
¡@
To sum up, I think that it is reasonable to draw Gatt chart by method 3, which is more simple.
¡@
¡@
¡@
¡@