flat assembler
Message board for the users of flat assembler.

Index > Main > Counting cycles with RDTSC

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 15 Mar 2009, 01:36
there is a loop counter, i just changed it to 50 and ran another test: the ratio of smaller values is very much the same: 17 out of 50 (ratio 0.34), while in the earlier test there were 3 out of 10 (ratio 0.3), so it does seem this is pretty constant. (also, there is never any other overhead value for me, always either 630 or 639, like i said above).

anyway, i know these numbers aren't guaranteed to be 100% correct or repeatable every time without exceptions, but this is probably the best way (still) i know of to measure performance of a (small) piece of code. eg, with C i've used clock_gettime() (with CLOCK_MONOTONIC), but got greater deviations between results there. and using cpu performance counters requires ring0 priviledges on linux, i believe.

it's probably just me, i prefer seeing the same result every time to seeing (even small) differences in results... Very Happy
Post 15 Mar 2009, 01:36
View user's profile Send private message Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo 15 Mar 2009, 01:54
buzzkill wrote:
using cpu performance counters requires ring0 priviledges on linux, i believe.


Can anyone confirm this? I've heard this several times, and I thought it's toggleable (although why you'd want to disable it or make it ring3-only is beyond me). In other words, shouldn't it work in ring 3 also?
Post 15 Mar 2009, 01:54
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 15 Mar 2009, 02:18
The overhead calculation in dap's code isn't aligned nor does it preload the stack into L1 cache. I might be wrong here, but I bet the 9 cycle difference results from loading a cache line from L2 into L1 - code or data. RDTSC can be accurate despite the lack of a guarantee to that effect. Small timings are nice because the likelihood of being interrupted is reduced, but often incorrectly assume ideal conditions.
Post 15 Mar 2009, 02:18
View user's profile Send private message Visit poster's website Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 15 Mar 2009, 02:24
rugxulo, your post motivated me to do some research. this is what i have found so far:
- i enabled CONFIG_X86_MSR in my kernel .config (and recompiled / rebooted).
- i found the device:
Code:
$ ls -l /dev/cpu/0/msr 
crw-rw---- 1 root root 202, 0 Mar 14 18:57 /dev/cpu/0/msr
    

- but i can't access it as user:
Code:
$ cat /dev/cpu/0/msr
cat: /dev/cpu/0/msr: Permission denied
    

- i can access it as root, but it spilled a lot of gibberish on my terminal Very Happy
- i found this page and downloaded msr-tools-1.2.tar.bz2, and compiled it.
- again, i can't run it as user:
Code:
$ ./rdmsr --hexadecimal 0x10
rdmsr: open: Permission denied
    

- this time, when i run it as root, i get sensible output:
Code:
$ sudo ./rdmsr --hexadecimal 0x10
4a168273d8fa
    


btw, msr register 0x10 is supposed to be the time stamp counter, according to this page.

so since i can access the msr's from userspace, but not as user, it appears there are accessible from ring3, but only for priviledged users...

if anyone can shed more light on this, please chime in... (although maybe we should take this to the linux part of the forum?)
Post 15 Mar 2009, 02:24
View user's profile Send private message Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 15 Mar 2009, 02:32
bitRAKE wrote:
The overhead calculation in dap's code isn't aligned nor does it preload the stack into L1 cache. I might be wrong here, but I bet the 9 cycle difference results from loading a cache line from L2 into L1 - code or data. RDTSC can be accurate despite the lack of a guarantee to that effect. Small timings are nice because the likelihood of being interrupted is reduced, but often incorrectly assume ideal conditions.


shouldn't that (loading a cache line from L2 into L1) only cause the overhead of the first run to be higher? i do several sets of 10 or 50 runs consecutively, on an otherwise idle system, so i'd expect that everything needed is in L1 after the first run?

anyway, if you have a patch for dap's program that fixes this, i'd like to see it, esp. how you go about aligning the calculation code since i don't think i've ever done that in C before.
Post 15 Mar 2009, 02:32
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 15 Mar 2009, 23:37
Yeah, the first run should bring everything into L1. Only if the routine being tested used more code/data than the size of L1 would it be a problem. Or, if the timing was interrupted.

AFAIK, alignment in C is compiler specific.

I revisited my timing code for 64-bit and there is only a variation of $1000 parts in $10C8EF0 - which is very good compared to your results above. Changes to the code quickly produced similar 1/3 results - with the variation +/- 7 cycles. IIRC, it was manually tweaked until it was consistent, but I don't know why it's more consistent. Embarassed

Even a simple test like:
Code:
      align 64
RDTSC_Test:
 mov [__RSP__],rsp       ; 48 89 25 22 EF BF FF
      and rsp,-64             ; 48 83 E4 C0
       push 0                  ; 6A 00
     pop rax                 ; 58
        cpuid                   ; 0F A2

; already aligned to 16

rept 256 { ; eight bytes each
    rdtsc                   ; 0F 31
     push rax                ; 50
        push 0                  ; 6A 00
     pop rax                 ; 58
        cpuid                   ; 0F A2
}
   pop rax
     neg eax
     mov ecx,256-1
@@:     pop rdx
        xadd edx,eax
        neg edx
     neg eax
     mov [rdi+rcx*4],edx
 loop @B

 mov rsp,[__RSP__]       ; 48 8B 25 F4 EE BF FF
      retn                    ; C3

align 64
RDTSC_Buffer rd 256

align 64
Whatever:

    lea edi,[RDTSC_Buffer]
      call RDTSC_Test
     call RDTSC_Test
     int3    
...couldn't produce consistent results better than 35/255.
Post 15 Mar 2009, 23:37
View user's profile Send private message Visit poster's website Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 16 Mar 2009, 00:31
Still, your results (35/255) are better, even though you don't serialize more than once before the cpuid (I think I read somewhere 3 times is recommended). I tried to modify dap's run_tests.asm a little, but couldn't get rid of the 9 cycles difference (even managed a bigger difference once Smile ). Anyways, we know that these timings (with rdtsc) are not perfect, so these results are good enough for me.

(OT: why do you do "and rsp,-64" in your code? It seems like an unusual way to modify the stack pointer?)
Post 16 Mar 2009, 00:31
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 16 Mar 2009, 01:09
AND RSP,-64 aligns the stack to the next cachline - it will always be lessthan the current stack pointer (remember the stack moves down in addresses). AND ESP,-64 clears the upper dword as well - which in most cases is okay. It just clears the low bits. We also have AND SPL,-64. Very Happy

The trick is to use XCHG with memory and consistency goes through the roof.
Code:
     align 64
; time routine RBP
RDTSC_Time:
   mov [__RSP__],rsp       ; 48 89 25 .. .. .. ..
      and esp,-64             ; 83 E4 C0
  xor eax,eax             ; 31 C0
     push rax                ; 50
        push rax                ; 50
        cpuid                   ; 0F A2

 rdtsc                   ; 0F 31 ;-- -----------------------: TIME START :
   xchg [rsp],rax          ; 48 87 04 24           better than 32-bit xchg
     cpuid                   ; 0F A2

 call rbp                ; FF D5
     xor eax,eax             ; 31 C0
     cpuid                   ; 0F A2
     rdtsc                   ; 0F 31 ;---------------------------: TIME END :

        xchg [rsp+8],rax        ; 48 87 44 24 08
    cpuid                   ; 0F A2
     nop                     ; 90                                     needed

 pop rcx                 ; 59
        pop rax                 ; 58
        sub eax,ecx             ; 29 C8
     mov rsp,[__RSP__]       ; 48 8B 25 .. .. .. ..
      retn                    ; C3    
This might be processor specific, but I believe it should work on many.
(notice the grouping into eight byte pieces)


Last edited by bitRAKE on 16 Mar 2009, 02:54; edited 1 time in total
Post 16 Mar 2009, 01:09
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20445
Location: In your JS exploiting you and your system
revolution 16 Mar 2009, 01:09
I once used these two macros to get within ±1 cycle count on my PentiumM. It returned consistent results for all the tests I was trying at the time.
Code:
macro        prolog
{
    xor     eax,eax
     cpuid
       stmxcsr [dummy_dat]
 pxor    xmm0,xmm0
   pxor    xmm1,xmm1
   pxor    xmm2,xmm2
   pxor    xmm3,xmm3
   pxor    xmm4,xmm4
   pxor    xmm5,xmm5
   pxor    xmm6,xmm6
   pxor    xmm7,xmm7
;  ldmxcsr [dummy_dat]  ;not needed, still get good results without it
 xor     eax,eax
     cpuid
       rdtsc
       mov     [tsc_lo],eax
        mov     [tsc_hi],edx
        xor     eax,eax
     cpuid
}

macro    epilog
{
    ldmxcsr [dummy_dat]
 xor     eax,eax
     cpuid
       rdtsc
       sub     eax,[tsc_lo]
        sbb     edx,[tsc_hi]
}    
I would run the prolog/epilog with no instructions to get the overhead value.
Code:
align 16
time_nothing:
     prolog
;code is empty, nothing in between
        epilog
      ret    
Then I can attempt to time a single instruction.
Code:
align 16
time_nop:
 prolog
      nop
 epilog
      ret    
And after subtracting the overhead it would return either -1, 0 or +1 every time.

All tests were run after momentarily putting the thread into realtime priority during the run and always discarding timings for the first two trials.

An interesting result was this 4 nop test that can run in one cycle. The nop is nearly completely optimised out of the instruction stream:
Code:
align 16
time_nop_nop_nop_nop:
     prolog
      nop
 nop
 nop
 nop
 epilog
      ret    
You can also try to add one more align 16 at the very end of the prolog macro. It may help to give some stability for running loops between the prolog and epilog.
Post 16 Mar 2009, 01:09
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4073
Location: vpcmpistri
bitRAKE 16 Mar 2009, 02:25
I'm seeing consistency of 1/389 using RDTSC/XCHG/CPUID.
Post 16 Mar 2009, 02:25
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.