flat assembler
Message board for the users of flat assembler.
![]() Goto page 1, 2 Next |
Author |
|
dap 04 May 2008, 14:13
Update : download latest version here.
Hello, Here is a small program which finds how many clocks a piece of code takes. If you use a processor which can change its frequency there is no guarantee that the results will be accurate, but I think they should be quite proportional to the real numbers. I wonder if I shouldn't discard the first count when computing the average. ![]() It will build a Windows executable but I only use two C functions so you should easily adapt it to anoter OS. Last edited by dap on 21 Dec 2008, 13:11; edited 3 times in total |
|||
![]() |
|
revolution 04 May 2008, 15:08
The RDTSC thing is very problematic. The later Intel CPU's now seem to count at a fixed rate even though the speed step thing is activated. The older CPU's will count at a different rate when the speed step is active.
I think you should consider using the lowest time. Computing the average tends to give results that include OS task switches etc. Unless, that is, you want to also include the OS overhead as part of the overall system speed. Last edited by revolution on 04 May 2008, 15:25; edited 1 time in total |
|||
![]() |
|
Madis731 04 May 2008, 15:23
Okay, the first is warmup so you shouldn't count it at all. The BTBs aren't warm and all caches are cold.
From the other loops you shouldn't take the average, but the 'median' which means the most probable. In ideal conditions you should take the minimum because you want to know the capabilities of the CPU, not the average. You can't optimize based on average ![]() When your processor changes its clock speed the RDTSC shows the same time except between the transitions maybe... Now there are more tiny things like REAL_TIME priority and ring0 execution... ...and finally - http://agner.org/optimize/ - some hints there |
|||
![]() |
|
dap 04 May 2008, 15:42
bitRAKE wrote: I hope you don't mind if I use that license in the future on my own productions. It is not mine - just to be clear : http://sam.zoy.org/wtfpl/ revolution wrote: The RDTSC thing is very problematic. The later Intel CPU's now seem to count at a fixed rate even though the speed step thing is activated. The older CPU's will count at a different rate when the speed stop is active. I've read a topic about this : http://www.canardplus.com/forums/showthread.php?t=22719 It's in French but maybe a Google translation will give you something interesting. The solution seems to be using the performance monitoring counters. I will do that once I will know how to write ring0 code on Windows. revolution wrote: I think you should consider using the lowest time. Computing the average tends to give results that include OS task switches etc. Unless, that is, you want to also include the OS overhead as part of the overall system speed. Madis731 wrote: From the other loops you shouldn't take the average, but the 'median' which means the most probable. In ideal conditions you should take the minimum because you want to know the capabilities of the CPU, not the average. You can't optimize based on average It's a good idea, I will add that in the results. Last edited by dap on 03 Jun 2008, 18:11; edited 1 time in total |
|||
![]() |
|
revolution 04 May 2008, 15:52
dap wrote: The solution seems to be using the performance monitoring counters. I will do that once I will know how to write ring0 code on Windows. |
|||
![]() |
|
dap 03 Jun 2008, 19:57
Here is a new version with minimums and medians. I have ran some tests and apparently the minimum method is the most efficient.
|
|||
![]() |
|
dap 12 Aug 2008, 17:42
New version:
Download it here : http://ftp-developpez.com/dap/codetiming.zip Last edited by dap on 21 Dec 2008, 13:11; edited 1 time in total |
|||
![]() |
|
wht36 08 Dec 2008, 10:13
Very nice! I recently converted the counter macros from masm32 (I think by MichaelW) into fasm. The cycle counts are different when using your method compared to theirs. Not sure why, but maybe I'm not testing correctly.
Edit: Removed MichaelW's code. It's now in the rar file attachment below. Last edited by wht36 on 04 Jan 2009, 04:07; edited 2 times in total |
|||
![]() |
|
wht36 03 Jan 2009, 09:03
To dap: Very nice code. I took the liberty of adding your code to the bunch of timing routines I've collected. The attached file combines all the routines into one exe so that one can time a piece code with all the timing implementations and see what each timing algorithm returns.
|
|||||||||||
![]() |
|
vid 03 Jan 2009, 11:46
My experience is that it is best to let code run once before starting the time-counting, so that you know it is cached.
|
|||
![]() |
|
DJ Mauretto 03 Jan 2009, 13:27
Further this way is not totally exact,
in this way: Code:
xor eax,eax
cpuid
rdtssc
....your benchmark code
this is more optimal: Code: xor eax,eax cpuid rdtsc xor eax,eax cpuid .... your benchmark code If you have a new Intel CPU then: Code:
rdtscp
xor eax,eax
cpuid
... your benchmark code
_________________ Nil Volentibus Arduum ![]() |
|||
![]() |
|
wht36 03 Jan 2009, 16:17
The intel documentation recommends using running cpuid a few times and using cpuid before and after the rdtsc, as Bastien/dap's code have done. This is probably the most correct thing to do. However, some have said that cpuid does not affect code timing markedly and is quite costly. Perhaps that's why some implementations have left it out.
Other things which people say should be used include: - SetProcessPriority to High or Realtime priority - SetThreadAffinityMask to using one core - Align code and data/have code/data at a fixed place in memory - Save all registers so code tested always start with the same values - Looping a few times and use the average/minimum of the timings I would be very pleased if you could give me a profiling routine so that I can compare it with all the others practically and see what is the difference. Curiously while all the profiling routines I have compared so far are relatively consistent in their own way, the cycles they reported often differ from one another. I myself cannot tell which one is the most accurate, but I would be pleased if you can tell me which one you prefer (or none? or if you got your own routine?). In any case I have added another routine, which has the cpuid before and after the code tested.
Last edited by wht36 on 03 Jan 2009, 16:37; edited 1 time in total |
|||||||||||
![]() |
|
wht36 03 Jan 2009, 16:33
vid wrote: My experience is that it is best to let code run once before starting the time-counting, so that you know it is cached. That is my thinking as well. Then again, if it's real life testing, should one not include the initial run as part of the timing as well? I have no answer to this question. |
|||
![]() |
|
bitRAKE 03 Jan 2009, 16:43
Typically, cycle tests are performed on code that will loop sufficiently to over shadow the initial code cache misses. This makes the timing of code already in the L1 cache more useful, imho.
|
|||
![]() |
|
wht36 03 Jan 2009, 16:56
Perhaps it doesn't matter then whether one include the initial timing or not. My brother has also commented that in the end even a few hundred cycles difference are not humanly noticeable, so why even care at all? He felt that one only needs to optimise code that will take more than a few seconds difference, so all this high precision rdtsc stuff is pretty arbitrary.
|
|||
![]() |
|
bitRAKE 03 Jan 2009, 17:10
Usually I only time the inner loop, so it does matter. I'm not timing many executions - which is not needed. Where the bar on optimization rests is a personal matter, imho. There is software we use every day millions of places all over the globe; or there is software waiting on other software. Human perception is not always the deciding factor, but even in that case it is difficult to know what others will do with your software.
|
|||
![]() |
|
wht36 04 Jan 2009, 04:14
I agree with you whole heartedly. I should think even a few frames delay in an animation may be noticeable and communication software / search engines etc. should be as fast as possible. Although I suspect many programmers are more concerned about the speed at which they produce code rather than the speed at which their code runs
![]() |
|||
![]() |
|
buzzkill 15 Mar 2009, 00:58
hi dap,
great program, i'm using it on linux 32-bit and it works like a charm ![]() Code: $ ./main _____________________________________________________ | # | Cycles | Overhead | Without overhead | |------|------------|--------------|------------------| | 1 | 693 | 639 | 54 | (discarded) | 2 | 684 | 630 | 54 | | 3 | 684 | 639 | 45 | | 4 | 684 | 639 | 45 | | 5 | 684 | 630 | 54 | | 6 | 684 | 639 | 45 | | 7 | 684 | 639 | 45 | | 8 | 684 | 630 | 54 | | 9 | 684 | 639 | 45 | | 10 | 684 | 639 | 45 | |______|____________|______________|__________________| Minimum ticks : 684 - Minimum overhead : 630 -------------------------------- 54 $ ./main _____________________________________________________ | # | Cycles | Overhead | Without overhead | |------|------------|--------------|------------------| | 1 | 684 | 639 | 45 | (discarded) | 2 | 684 | 639 | 45 | | 3 | 693 | 639 | 54 | | 4 | 684 | 639 | 45 | | 5 | 684 | 639 | 45 | | 6 | 693 | 639 | 54 | | 7 | 684 | 639 | 45 | | 8 | 684 | 639 | 45 | | 9 | 693 | 639 | 54 | | 10 | 684 | 639 | 45 | |______|____________|______________|__________________| Minimum ticks : 684 - Minimum overhead : 639 -------------------------------- 45 $ ./main _____________________________________________________ | # | Cycles | Overhead | Without overhead | |------|------------|--------------|------------------| | 1 | 693 | 639 | 54 | (discarded) | 2 | 684 | 630 | 54 | | 3 | 684 | 639 | 45 | | 4 | 684 | 639 | 45 | | 5 | 684 | 630 | 54 | | 6 | 684 | 639 | 45 | | 7 | 684 | 639 | 45 | | 8 | 684 | 630 | 54 | | 9 | 684 | 639 | 45 | | 10 | 684 | 639 | 45 | |______|____________|______________|__________________| Minimum ticks : 684 - Minimum overhead : 630 -------------------------------- 54 $ ./main _____________________________________________________ | # | Cycles | Overhead | Without overhead | |------|------------|--------------|------------------| | 1 | 684 | 639 | 45 | (discarded) | 2 | 693 | 639 | 54 | | 3 | 684 | 639 | 45 | | 4 | 684 | 639 | 45 | | 5 | 693 | 639 | 54 | | 6 | 684 | 639 | 45 | | 7 | 684 | 639 | 45 | | 8 | 693 | 639 | 54 | | 9 | 684 | 639 | 45 | | 10 | 684 | 639 | 45 | |______|____________|______________|__________________| Minimum ticks : 684 - Minimum overhead : 639 -------------------------------- 45 (i've just pasted 4 runs here, but it's the same with more runs). so as you can see, 639 seems to be the "correct" overhead count for me, but because your program uses the minimum overhead to calculate the number of cycles, i get mixed results: 2 times 45 cycles and 2 times 54 cycles. therefore, i've modified your program to determine the maximum overhead, and use that to calculate the number of cycles. that way, i get 4 times the same answer: 45 cycles (i won't post output to keep post length short ![]() i don't know what could be causing this difference in overhead (it's always 630 or 639 btw, never anything else), but anyway, maybe there are people out there who get similar results to mine (my cpu is an intel e6600 btw) and who could get slightly more precise results with using max overhead, too. one other question: how do you go about timing a piece of 'live code', ie. a piece of a larger program? you can't just strip away the rest of the program, because eg. when you access a register, and that has some random value, you get an invalid memory reference (and hence a segfault). |
|||
![]() |
|
revolution 15 Mar 2009, 01:07
When using RDTSC you can't really rely on small values. There are lots of things happening inside the CPU that can affect the readings.
Try to increase the loop counter to do a few more runs between RDTSC instructions. I haven't looked at the source so I am just assuming it has a loop counter. it not, add one. |
|||
![]() |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.