flat assembler
Message board for the users of flat assembler.
Index
> Main > Counting cycles with RDTSC Goto page Previous 1, 2 |
Author |
|
rugxulo 15 Mar 2009, 01:54
buzzkill wrote: using cpu performance counters requires ring0 priviledges on linux, i believe. Can anyone confirm this? I've heard this several times, and I thought it's toggleable (although why you'd want to disable it or make it ring3-only is beyond me). In other words, shouldn't it work in ring 3 also? |
|||
15 Mar 2009, 01:54 |
|
bitRAKE 15 Mar 2009, 02:18
The overhead calculation in dap's code isn't aligned nor does it preload the stack into L1 cache. I might be wrong here, but I bet the 9 cycle difference results from loading a cache line from L2 into L1 - code or data. RDTSC can be accurate despite the lack of a guarantee to that effect. Small timings are nice because the likelihood of being interrupted is reduced, but often incorrectly assume ideal conditions.
|
|||
15 Mar 2009, 02:18 |
|
buzzkill 15 Mar 2009, 02:24
rugxulo, your post motivated me to do some research. this is what i have found so far:
- i enabled CONFIG_X86_MSR in my kernel .config (and recompiled / rebooted). - i found the device: Code: $ ls -l /dev/cpu/0/msr crw-rw---- 1 root root 202, 0 Mar 14 18:57 /dev/cpu/0/msr - but i can't access it as user: Code: $ cat /dev/cpu/0/msr cat: /dev/cpu/0/msr: Permission denied - i can access it as root, but it spilled a lot of gibberish on my terminal - i found this page and downloaded msr-tools-1.2.tar.bz2, and compiled it. - again, i can't run it as user: Code: $ ./rdmsr --hexadecimal 0x10 rdmsr: open: Permission denied - this time, when i run it as root, i get sensible output: Code: $ sudo ./rdmsr --hexadecimal 0x10 4a168273d8fa btw, msr register 0x10 is supposed to be the time stamp counter, according to this page. so since i can access the msr's from userspace, but not as user, it appears there are accessible from ring3, but only for priviledged users... if anyone can shed more light on this, please chime in... (although maybe we should take this to the linux part of the forum?) |
|||
15 Mar 2009, 02:24 |
|
buzzkill 15 Mar 2009, 02:32
bitRAKE wrote: The overhead calculation in dap's code isn't aligned nor does it preload the stack into L1 cache. I might be wrong here, but I bet the 9 cycle difference results from loading a cache line from L2 into L1 - code or data. RDTSC can be accurate despite the lack of a guarantee to that effect. Small timings are nice because the likelihood of being interrupted is reduced, but often incorrectly assume ideal conditions. shouldn't that (loading a cache line from L2 into L1) only cause the overhead of the first run to be higher? i do several sets of 10 or 50 runs consecutively, on an otherwise idle system, so i'd expect that everything needed is in L1 after the first run? anyway, if you have a patch for dap's program that fixes this, i'd like to see it, esp. how you go about aligning the calculation code since i don't think i've ever done that in C before. |
|||
15 Mar 2009, 02:32 |
|
bitRAKE 15 Mar 2009, 23:37
Yeah, the first run should bring everything into L1. Only if the routine being tested used more code/data than the size of L1 would it be a problem. Or, if the timing was interrupted.
AFAIK, alignment in C is compiler specific. I revisited my timing code for 64-bit and there is only a variation of $1000 parts in $10C8EF0 - which is very good compared to your results above. Changes to the code quickly produced similar 1/3 results - with the variation +/- 7 cycles. IIRC, it was manually tweaked until it was consistent, but I don't know why it's more consistent. Even a simple test like: Code: align 64 RDTSC_Test: mov [__RSP__],rsp ; 48 89 25 22 EF BF FF and rsp,-64 ; 48 83 E4 C0 push 0 ; 6A 00 pop rax ; 58 cpuid ; 0F A2 ; already aligned to 16 rept 256 { ; eight bytes each rdtsc ; 0F 31 push rax ; 50 push 0 ; 6A 00 pop rax ; 58 cpuid ; 0F A2 } pop rax neg eax mov ecx,256-1 @@: pop rdx xadd edx,eax neg edx neg eax mov [rdi+rcx*4],edx loop @B mov rsp,[__RSP__] ; 48 8B 25 F4 EE BF FF retn ; C3 align 64 RDTSC_Buffer rd 256 align 64 Whatever: lea edi,[RDTSC_Buffer] call RDTSC_Test call RDTSC_Test int3 |
|||
15 Mar 2009, 23:37 |
|
buzzkill 16 Mar 2009, 00:31
Still, your results (35/255) are better, even though you don't serialize more than once before the cpuid (I think I read somewhere 3 times is recommended). I tried to modify dap's run_tests.asm a little, but couldn't get rid of the 9 cycles difference (even managed a bigger difference once ). Anyways, we know that these timings (with rdtsc) are not perfect, so these results are good enough for me.
(OT: why do you do "and rsp,-64" in your code? It seems like an unusual way to modify the stack pointer?) |
|||
16 Mar 2009, 00:31 |
|
bitRAKE 16 Mar 2009, 01:09
AND RSP,-64 aligns the stack to the next cachline - it will always be lessthan the current stack pointer (remember the stack moves down in addresses). AND ESP,-64 clears the upper dword as well - which in most cases is okay. It just clears the low bits. We also have AND SPL,-64.
The trick is to use XCHG with memory and consistency goes through the roof. Code: align 64 ; time routine RBP RDTSC_Time: mov [__RSP__],rsp ; 48 89 25 .. .. .. .. and esp,-64 ; 83 E4 C0 xor eax,eax ; 31 C0 push rax ; 50 push rax ; 50 cpuid ; 0F A2 rdtsc ; 0F 31 ;-- -----------------------: TIME START : xchg [rsp],rax ; 48 87 04 24 better than 32-bit xchg cpuid ; 0F A2 call rbp ; FF D5 xor eax,eax ; 31 C0 cpuid ; 0F A2 rdtsc ; 0F 31 ;---------------------------: TIME END : xchg [rsp+8],rax ; 48 87 44 24 08 cpuid ; 0F A2 nop ; 90 needed pop rcx ; 59 pop rax ; 58 sub eax,ecx ; 29 C8 mov rsp,[__RSP__] ; 48 8B 25 .. .. .. .. retn ; C3 (notice the grouping into eight byte pieces) Last edited by bitRAKE on 16 Mar 2009, 02:54; edited 1 time in total |
|||
16 Mar 2009, 01:09 |
|
revolution 16 Mar 2009, 01:09
I once used these two macros to get within ±1 cycle count on my PentiumM. It returned consistent results for all the tests I was trying at the time.
Code: macro prolog { xor eax,eax cpuid stmxcsr [dummy_dat] pxor xmm0,xmm0 pxor xmm1,xmm1 pxor xmm2,xmm2 pxor xmm3,xmm3 pxor xmm4,xmm4 pxor xmm5,xmm5 pxor xmm6,xmm6 pxor xmm7,xmm7 ; ldmxcsr [dummy_dat] ;not needed, still get good results without it xor eax,eax cpuid rdtsc mov [tsc_lo],eax mov [tsc_hi],edx xor eax,eax cpuid } macro epilog { ldmxcsr [dummy_dat] xor eax,eax cpuid rdtsc sub eax,[tsc_lo] sbb edx,[tsc_hi] } Code: align 16 time_nothing: prolog ;code is empty, nothing in between epilog ret Code: align 16 time_nop: prolog nop epilog ret All tests were run after momentarily putting the thread into realtime priority during the run and always discarding timings for the first two trials. An interesting result was this 4 nop test that can run in one cycle. The nop is nearly completely optimised out of the instruction stream: Code: align 16 time_nop_nop_nop_nop: prolog nop nop nop nop epilog ret |
|||
16 Mar 2009, 01:09 |
|
bitRAKE 16 Mar 2009, 02:25
I'm seeing consistency of 1/389 using RDTSC/XCHG/CPUID.
|
|||
16 Mar 2009, 02:25 |
|
Goto page Previous 1, 2 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.