latency of shared memory

From Volkov's paper,
Vasily Volkov, James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra. In SC ’08: Preceedings of the 2008 ACM/IEEE conference on Supercomputing.
Piscataway, NJ, USA, 2008, IEEE Press.
( can be obtained in the thread http://forums.nvidia.com/index.php?showtopic=89084 )

authors show that latency of shared memory of 8800GTX is 36 cycle, see figure 1

figure 1: smem_latency_2.jpg

I try to find latency of shared memory of Tesla C1060, and the experimental result shows

latency of shared memory is 34 cycle.

The procedures of my experiment is:

I modify code in cuda_latency.tar.gz provided by @Sylvain Collange in the thread

http://forums.nvidia.com/index.php?showtopic=80451&pid=468968&mode=threaded&start=#entry468968

The kernel is
[code]
#define SMEM_SIZE 256
static __global__ void smem_latency(int * data, timing_stats * timings)
{
__shared__ float b[SMEM_SIZE] ;
int threadNum = threadIdx.x;
volatile unsigned int start_time = 1664;
volatile unsigned int end_time = 1664;
#pragma unroll 1
for (int i = 0; i < SMEM_SIZE; ++i){
b[i] = data[i] ;
}
__syncthreads();

int k = 0 ;
for( int j = 0 ; j < 2 ; j++){
start_time = clock();

#pragma unroll
for (int i = 0; i < SMEM_SIZE ; ++i){
k = b[k] ;
}
end_time = clock();
}
__syncthreads();

timings[threadNum].start_time = start_time;
timings[threadNum].end_time = end_time;

if ( 0 == threadNum ){
timings[1].start_time = k;
}
}
[/code]

and execution configuration is 1 grid and 1 thread
[code]
dim3 grid(1, 1, 1);
dim3 threads(1, 1, 1);
smem_latency<<< grid, threads >>>(data_gpu, timings_gpu) ;
[/code]

The result is 58 cycle. ( this is bigger than 36 cycle )

However if we use decuda to deassembly .cubin file, then

[code]
int k = 0 ;
for( int j = 0 ; j < 2 ; j++){
start_time = clock();

#pragma unroll
for (int i = 0; i < SMEM_SIZE ; ++i){
k = b[k] ;
}
end_time = clock();
}
__syncthreads();
[/code]

would be decoded as

[code]
label1: mov.b32 $r2, %clock
shl.u32 $r2, $r2, 0x00000001
movsh.b32 $ofs1, $r3, 0x00000002
cvt.rzi.s32.f32 $r3, s[$ofs1+0x0020]
movsh.b32 $ofs1, $r3, 0x00000002
cvt.rzi.s32.f32 $r3, s[$ofs1+0x0020]
movsh.b32 $ofs1, $r3, 0x00000002
cvt.rzi.s32.f32 $r3, s[$ofs1+0x0020]
...
movsh.b32 $ofs1, $r3, 0x00000002
cvt.rzi.s32.f32 $r3, s[$ofs1+0x0020]
mov.b32 $r4, %clock
shl.u32 $r4, $r4, 0x00000001
add.b32 $r1, $r1, 0x00000001
set.ne.s32 $p0|$o127, $r1, c1[0x0004]
@$p0.ne bra.label label1
bar.sync.u32 0x00000000
[/code]

we can compare source code with result of decuda and find relationship between them,
[code]
r1 <-- 0 is variable “j”
r3 <-- 0 is variabel “k”
r2 = start_time <--�%clock x 2
$ofs1 <-- k * sizeof(int)
r3 <-- b[k]
$ofs1 <-- k * sizeof(int)
r3 <-- b[k]
... ...
r4 = end_time <-- %clock x 2
...
[/codde]

This means that "k = b[k]" has two instructions
[code]
S1 : $ofs1 <-- k * sizeof(int)
S2 : r3 <-- b[k]
[/code]
and its Gatt chart is shown in figure 2.

figure 2: smem_latency.jpg

but latency of shared memory should be execution time of instruction 2,

we have known pipeline latency of MAD is 24 cycle, so execution time of S2 is 58 - 24 = 34 cycle,

which is latency of shared memory.