flat assembler
Message board for the users of flat assembler.

Index > Main > Counting cycles with RDTSC

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
dap



Joined: 01 Dec 2007
Posts: 61
Location: Belgium
dap 04 May 2008, 14:13
Update : download latest version here.

Hello,

Here is a small program which finds how many clocks a piece of code takes. If you use a processor which can change its frequency there is no guarantee that the results will be accurate, but I think they should be quite proportional to the real numbers.
I wonder if I shouldn't discard the first count when computing the average. Confused
It will build a Windows executable but I only use two C functions so you should easily adapt it to anoter OS.

_________________
(French only) http://dap.developpez.com


Last edited by dap on 21 Dec 2008, 13:11; edited 3 times in total
Post 04 May 2008, 14:13
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4156
Location: vpcmpistri
bitRAKE 04 May 2008, 14:58
I hope you don't mind if I use that license in the future on my own productions. Very Happy
Code:
Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

C:\!\[forum]\dap>codetiming
 ____________________________________________________
|   # |     Cycles |     Overhead | Without overhead |
|-----|------------|--------------|------------------|
|   1 |        353 |          537 |       4294967112 |
|   2 |        353 |          291 |               62 |
|   3 |        353 |          291 |               62 |
|   4 |        353 |          291 |               62 |
|   5 |        353 |          291 |               62 |
|   6 |        353 |          291 |               62 |
|   7 |        353 |          291 |               62 |
|   8 |        353 |          291 |               62 |
|   9 |        353 |          291 |               62 |
|  10 |        353 |          291 |               62 |
|----------------------------------------------------|
|                            Average :  429496767.00 |
|____________________________________________________|


C:\!\[forum]\dap>    
Adding loop warm-up code:
Code:
.loopTests:
; Count the number of cycles of an empty code (overhead)
        serialize
        rdtsc
        serialize
        rdtsc
        serialize
        rdtsc
...    
...resulted in:
Code:
C:\!\[forum]\dap>codetiming
 ____________________________________________________
|   # |     Cycles |     Overhead | Without overhead |
|-----|------------|--------------|------------------|
|   1 |        490 |          291 |              199 |
|   2 |        340 |          291 |               49 |
|   3 |        340 |          291 |               49 |
|   4 |        340 |          291 |               49 |
|   5 |        340 |          291 |               49 |
|   6 |        340 |          291 |               49 |
|   7 |        340 |          291 |               49 |
|   8 |        340 |          291 |               49 |
|   9 |        340 |          291 |               49 |
|  10 |        340 |          291 |               49 |
|----------------------------------------------------|
|                            Average :         64.00 |
|____________________________________________________|


C:\!\[forum]\dap>    

_________________
¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup


Last edited by bitRAKE on 04 May 2008, 15:16; edited 3 times in total
Post 04 May 2008, 14:58
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20516
Location: In your JS exploiting you and your system
revolution 04 May 2008, 15:08
The RDTSC thing is very problematic. The later Intel CPU's now seem to count at a fixed rate even though the speed step thing is activated. The older CPU's will count at a different rate when the speed step is active.

I think you should consider using the lowest time. Computing the average tends to give results that include OS task switches etc. Unless, that is, you want to also include the OS overhead as part of the overall system speed.


Last edited by revolution on 04 May 2008, 15:25; edited 1 time in total
Post 04 May 2008, 15:08
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 04 May 2008, 15:23
Okay, the first is warmup so you shouldn't count it at all. The BTBs aren't warm and all caches are cold.

From the other loops you shouldn't take the average, but the 'median' which means the most probable. In ideal conditions you should take the minimum because you want to know the capabilities of the CPU, not the average. You can't optimize based on average Smile But in real-life conditions you should stick to the median because otherwise you might get -1 sometimes. The average you should round down to integer clocks. In some cases (NOP on C2D) you need to round to 1/3 clocks. 100 NOPs might take 33.356 clocks when you average them. Its actually 33.(3)

When your processor changes its clock speed the RDTSC shows the same time except between the transitions maybe...

Now there are more tiny things like REAL_TIME priority and ring0 execution...

...and finally - http://agner.org/optimize/ - some hints there
Post 04 May 2008, 15:23
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
dap



Joined: 01 Dec 2007
Posts: 61
Location: Belgium
dap 04 May 2008, 15:42
bitRAKE wrote:
I hope you don't mind if I use that license in the future on my own productions. Very Happy

It is not mine - just to be clear : http://sam.zoy.org/wtfpl/

revolution wrote:
The RDTSC thing is very problematic. The later Intel CPU's now seem to count at a fixed rate even though the speed step thing is activated. The older CPU's will count at a different rate when the speed stop is active.

I've read a topic about this : http://www.canardplus.com/forums/showthread.php?t=22719
It's in French but maybe a Google translation will give you something interesting.
The solution seems to be using the performance monitoring counters. I will do that once I will know how to write ring0 code on Windows.

revolution wrote:
I think you should consider using the lowest time. Computing the average tends to give results that include OS task switches etc. Unless, that is, you want to also include the OS overhead as part of the overall system speed.

Madis731 wrote:
From the other loops you shouldn't take the average, but the 'median' which means the most probable. In ideal conditions you should take the minimum because you want to know the capabilities of the CPU, not the average. You can't optimize based on average Smile But in real-life conditions you should stick to the median because otherwise you might get -1 sometimes.


It's a good idea, I will add that in the results.

_________________
(French only) http://dap.developpez.com


Last edited by dap on 03 Jun 2008, 18:11; edited 1 time in total
Post 04 May 2008, 15:42
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20516
Location: In your JS exploiting you and your system
revolution 04 May 2008, 15:52
dap wrote:
The solution seems to be using the performance monitoring counters. I will do that once I will know how to write ring0 code on Windows.
You don't need to do the ring0 (unless you really want to), there is an API interface, the PDH (performance data helper) API. PdhOpenQuery, PdhGetCounterInfo, etc.
Post 04 May 2008, 15:52
View user's profile Send private message Visit poster's website Reply with quote
dap



Joined: 01 Dec 2007
Posts: 61
Location: Belgium
dap 03 Jun 2008, 19:57
Here is a new version with minimums and medians. I have ran some tests and apparently the minimum method is the most efficient.

_________________
(French only) http://dap.developpez.com
Post 03 Jun 2008, 19:57
View user's profile Send private message Visit poster's website Reply with quote
dap



Joined: 01 Dec 2007
Posts: 61
Location: Belgium
dap 12 Aug 2008, 17:42
New version:
  • Now supports 32 bits and 64 bits Unix
  • Fixed a bug in Windows version
  • Now the test code can modify any register
  • Removed the average and the median because I didn't use them


Download it here : http://ftp-developpez.com/dap/codetiming.zip

_________________
(French only) http://dap.developpez.com


Last edited by dap on 21 Dec 2008, 13:11; edited 1 time in total
Post 12 Aug 2008, 17:42
View user's profile Send private message Visit poster's website Reply with quote
wht36



Joined: 18 Sep 2005
Posts: 106
wht36 08 Dec 2008, 10:13
Very nice! I recently converted the counter macros from masm32 (I think by MichaelW) into fasm. The cycle counts are different when using your method compared to theirs. Not sure why, but maybe I'm not testing correctly.

Edit: Removed MichaelW's code. It's now in the rar file attachment below.


Last edited by wht36 on 04 Jan 2009, 04:07; edited 2 times in total
Post 08 Dec 2008, 10:13
View user's profile Send private message Reply with quote
wht36



Joined: 18 Sep 2005
Posts: 106
wht36 03 Jan 2009, 09:03
To dap: Very nice code. I took the liberty of adding your code to the bunch of timing routines I've collected. The attached file combines all the routines into one exe so that one can time a piece code with all the timing implementations and see what each timing algorithm returns.


Description: Bunch of timing routines, demonstrating various ways of using RDTSC to time a piece of code
Download
Filename: clock_counters4.rar
Filesize: 21.75 KB
Downloaded: 506 Time(s)

Post 03 Jan 2009, 09:03
View user's profile Send private message Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid 03 Jan 2009, 11:46
My experience is that it is best to let code run once before starting the time-counting, so that you know it is cached.
Post 03 Jan 2009, 11:46
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
DJ Mauretto



Joined: 14 Mar 2007
Posts: 464
Location: Rome,Italy
DJ Mauretto 03 Jan 2009, 13:27
Further this way is not totally exact,
in this way:
Code:
xor eax,eax
cpuid 
rdtssc
....your benchmark code
    
subsequent instructions may begin execution before the read operation is performed.
this is more optimal:
Code:
xor eax,eax
cpuid
rdtsc
xor eax,eax
cpuid

.... your benchmark code    


If you have a new Intel CPU then:
Code:
rdtscp
xor eax,eax
cpuid

... your benchmark code
    

_________________
Nil Volentibus Arduum Razz
Post 03 Jan 2009, 13:27
View user's profile Send private message Reply with quote
wht36



Joined: 18 Sep 2005
Posts: 106
wht36 03 Jan 2009, 16:17
The intel documentation recommends using running cpuid a few times and using cpuid before and after the rdtsc, as Bastien/dap's code have done. This is probably the most correct thing to do. However, some have said that cpuid does not affect code timing markedly and is quite costly. Perhaps that's why some implementations have left it out.

Other things which people say should be used include:
- SetProcessPriority to High or Realtime priority
- SetThreadAffinityMask to using one core
- Align code and data/have code/data at a fixed place in memory
- Save all registers so code tested always start with the same values
- Looping a few times and use the average/minimum of the timings

I would be very pleased if you could give me a profiling routine so that I can compare it with all the others practically and see what is the difference. Curiously while all the profiling routines I have compared so far are relatively consistent in their own way, the cycles they reported often differ from one another. I myself cannot tell which one is the most accurate, but I would be pleased if you can tell me which one you prefer (or none? or if you got your own routine?).

In any case I have added another routine, which has the cpuid before and after the code tested.


Description: Added another simple profiling routine (Simple2), which has an <xor eax,eax><cpuid> before and after code tested.
Download
Filename: clock_counters5.rar
Filesize: 21.94 KB
Downloaded: 500 Time(s)



Last edited by wht36 on 03 Jan 2009, 16:37; edited 1 time in total
Post 03 Jan 2009, 16:17
View user's profile Send private message Reply with quote
wht36



Joined: 18 Sep 2005
Posts: 106
wht36 03 Jan 2009, 16:33
vid wrote:
My experience is that it is best to let code run once before starting the time-counting, so that you know it is cached.

That is my thinking as well. Then again, if it's real life testing, should one not include the initial run as part of the timing as well? I have no answer to this question.
Post 03 Jan 2009, 16:33
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4156
Location: vpcmpistri
bitRAKE 03 Jan 2009, 16:43
Typically, cycle tests are performed on code that will loop sufficiently to over shadow the initial code cache misses. This makes the timing of code already in the L1 cache more useful, imho.
Post 03 Jan 2009, 16:43
View user's profile Send private message Visit poster's website Reply with quote
wht36



Joined: 18 Sep 2005
Posts: 106
wht36 03 Jan 2009, 16:56
Perhaps it doesn't matter then whether one include the initial timing or not. My brother has also commented that in the end even a few hundred cycles difference are not humanly noticeable, so why even care at all? He felt that one only needs to optimise code that will take more than a few seconds difference, so all this high precision rdtsc stuff is pretty arbitrary.
Post 03 Jan 2009, 16:56
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4156
Location: vpcmpistri
bitRAKE 03 Jan 2009, 17:10
Usually I only time the inner loop, so it does matter. I'm not timing many executions - which is not needed. Where the bar on optimization rests is a personal matter, imho. There is software we use every day millions of places all over the globe; or there is software waiting on other software. Human perception is not always the deciding factor, but even in that case it is difficult to know what others will do with your software.
Post 03 Jan 2009, 17:10
View user's profile Send private message Visit poster's website Reply with quote
wht36



Joined: 18 Sep 2005
Posts: 106
wht36 04 Jan 2009, 04:14
I agree with you whole heartedly. I should think even a few frames delay in an animation may be noticeable and communication software / search engines etc. should be as fast as possible. Although I suspect many programmers are more concerned about the speed at which they produce code rather than the speed at which their code runs Smile
Post 04 Jan 2009, 04:14
View user's profile Send private message Reply with quote
buzzkill



Joined: 15 Mar 2009
Posts: 111
Location: the nether lands
buzzkill 15 Mar 2009, 00:58
hi dap,

great program, i'm using it on linux 32-bit and it works like a charm Smile one question though: when running it several times, the overhead is nearly always the same, with just a few exceptions with a smaller value:
Code:
$ ./main 
 _____________________________________________________
|    # |     Cycles |     Overhead | Without overhead |
|------|------------|--------------|------------------|
|    1 |        693 |          639 |               54 | (discarded)
|    2 |        684 |          630 |               54 |
|    3 |        684 |          639 |               45 |
|    4 |        684 |          639 |               45 |
|    5 |        684 |          630 |               54 |
|    6 |        684 |          639 |               45 |
|    7 |        684 |          639 |               45 |
|    8 |        684 |          630 |               54 |
|    9 |        684 |          639 |               45 |
|   10 |        684 |          639 |               45 |
|______|____________|______________|__________________|

                       Minimum ticks    :        684
                     - Minimum overhead :        630
                     --------------------------------
                                                  54
$ ./main 
 _____________________________________________________
|    # |     Cycles |     Overhead | Without overhead |
|------|------------|--------------|------------------|
|    1 |        684 |          639 |               45 | (discarded)
|    2 |        684 |          639 |               45 |
|    3 |        693 |          639 |               54 |
|    4 |        684 |          639 |               45 |
|    5 |        684 |          639 |               45 |
|    6 |        693 |          639 |               54 |
|    7 |        684 |          639 |               45 |
|    8 |        684 |          639 |               45 |
|    9 |        693 |          639 |               54 |
|   10 |        684 |          639 |               45 |
|______|____________|______________|__________________|

                       Minimum ticks    :        684
                     - Minimum overhead :        639
                     --------------------------------
                                                  45
$ ./main 
 _____________________________________________________
|    # |     Cycles |     Overhead | Without overhead |
|------|------------|--------------|------------------|
|    1 |        693 |          639 |               54 | (discarded)
|    2 |        684 |          630 |               54 |
|    3 |        684 |          639 |               45 |
|    4 |        684 |          639 |               45 |
|    5 |        684 |          630 |               54 |
|    6 |        684 |          639 |               45 |
|    7 |        684 |          639 |               45 |
|    8 |        684 |          630 |               54 |
|    9 |        684 |          639 |               45 |
|   10 |        684 |          639 |               45 |
|______|____________|______________|__________________|

                       Minimum ticks    :        684
                     - Minimum overhead :        630
                     --------------------------------
                                                  54
$ ./main 
 _____________________________________________________
|    # |     Cycles |     Overhead | Without overhead |
|------|------------|--------------|------------------|
|    1 |        684 |          639 |               45 | (discarded)
|    2 |        693 |          639 |               54 |
|    3 |        684 |          639 |               45 |
|    4 |        684 |          639 |               45 |
|    5 |        693 |          639 |               54 |
|    6 |        684 |          639 |               45 |
|    7 |        684 |          639 |               45 |
|    8 |        693 |          639 |               54 |
|    9 |        684 |          639 |               45 |
|   10 |        684 |          639 |               45 |
|______|____________|______________|__________________|

                       Minimum ticks    :        684
                     - Minimum overhead :        639
                     --------------------------------
                                                  45
    

(i've just pasted 4 runs here, but it's the same with more runs).

so as you can see, 639 seems to be the "correct" overhead count for me, but because your program uses the minimum overhead to calculate the number of cycles, i get mixed results: 2 times 45 cycles and 2 times 54 cycles.

therefore, i've modified your program to determine the maximum overhead, and use that to calculate the number of cycles. that way, i get 4 times the same answer: 45 cycles (i won't post output to keep post length short Smile)

i don't know what could be causing this difference in overhead (it's always 630 or 639 btw, never anything else), but anyway, maybe there are people out there who get similar results to mine (my cpu is an intel e6600 btw) and who could get slightly more precise results with using max overhead, too.


one other question: how do you go about timing a piece of 'live code', ie. a piece of a larger program? you can't just strip away the rest of the program, because eg. when you access a register, and that has some random value, you get an invalid memory reference (and hence a segfault).
Post 15 Mar 2009, 00:58
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20516
Location: In your JS exploiting you and your system
revolution 15 Mar 2009, 01:07
When using RDTSC you can't really rely on small values. There are lots of things happening inside the CPU that can affect the readings.

Try to increase the loop counter to do a few more runs between RDTSC instructions. I haven't looked at the source so I am just assuming it has a loop counter. it not, add one.
Post 15 Mar 2009, 01:07
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.