
PROFILE v2.0 (c) 2002 Fabio Bizzetti (a.k.a. Maverick).

Official download location: http://web.tiscali.it/bizzetti/coding/asm/x86/profile.zip
If you need to email me for some important reason: bizzetti@libero.it

DESCRIPTION: This is the new, freeware (of which I still retain all the copyrights, anyway), official, improved version of PROFILE.
PROFILE is a tool that measures with extreme precision and consistency how many cycles does it take to execute a certain subroutine on x86 CPU's (at least a Pentium is required). The cost of the RET is automatically subtracted, so that rather than profiling a subroutine it looks simply like profiling a piece of code.
Typical uses are comparative ones, i.e. test optimization variants of a certain routine and check if the code is *really* faster or not.. instead of counting on paper the cycles which is not reliable anymore, with modern CPU's.

NOTE: YOU MUST READ THIS DOCUMENT CAREFULLY, BECAUSE I WILL NOT OFFER ANY SUPPORT IF YOU DIDN'T FOLLOW ALL OF THE INSTRUCTIONS EXACTLY, EITHER YOU UNDERSTAND THE IMPORTANCE OF ALIGNMENT (OR ANY OTHER MENTIONED ISSUE) OR NOT.
IF YOU MODIFY THE ROUTINE IN ANY PART, I OFFER NO SUPPORT ANYMORE, SO DO IT TOTALLY AT YOUR OWN RISK. PART OF THE LICENCE OF USE STATES THAT IT IS FORBIDDEN TO PUBLICLY REPORT ANY PERFORMANCE PROBLEM OF PROFILE IF PROFILE WAS MODIFIED IN ANY PART, OR IF PROFILE'S USE REQUIREMENTS (E.G. ALIGNMENT) AREN'T FOLLOWED EXACTLY.

LICENCE OF USE:
You are allowed to use PROFILE as long as you don't make a direct profit from it (it is acceptable though that you make indirect profit from it, such as if it gives benefits to your own code's optimization work); as long as you don't claim you're the author of it; and as long as it wasn't modified in any part.
Redistribution must be done with all files in their perfectly original form, and completely free of charge (this means also no charge for disks, ftp or other transmission means).
Remember that this is freeware software: use it at your own risk. Any trouble you may encounter caused directly or indirectly by the use of PROFILE doesn't entitle you in claiming any damages or indemnity. If you do not agree with this policy, delete this and all other files related to PROFILE immediately.

NEW VERSION:
This new version of PROFILE is now assembler-independent, FPU aware, and has several small other improvements. For example, under a preemptive multitasking OS such as Windows 9x/NT, our process may be switched off in the middle of the profiling work. This means that the profiler will return a wrong value, because of the OS interference.
Although rare to happen, we have to take into account this possibility, to offer as much reliability as possible, even without a human/intelligent interpretation of the results. The new PROFILEr detects and fixes automatically abnormal results/situations, so you can always be sure that what you get is what the CPU really spends on your test routine.

For example, this shows the consistency and precision of the new PROFILEr:


Test Code,           CPU:  Pentium     Athlon  
just a RET:                      0          0       cycles
 1 NOP + RET:                    1          1       cycles
 2 NOP + RET:                    1          1       cycles
 3 NOP + RET:                    2          1       cycles
 4 NOP + RET:                    2          2       cycles
 5 NOP + RET:                    3          2       cycles
 6 NOP + RET:                    3          2       cycles
 7 NOP + RET:                    4          3       cycles
 8 NOP + RET:                    4          3       cycles
 9 NOP + RET:                    5          3       cycles
10 NOP + RET:                    5          4       cycles
11 NOP + RET:                    6          4       cycles
12 NOP + RET:                    6          4       cycles
13 NOP + RET:                    7          5       cycles
14 NOP + RET:                    7          5       cycles
15 NOP + RET:                    8          5       cycles
16 NOP + RET:                    8          6       cycles
17 NOP + RET:                    9          6       cycles
18 NOP + RET:                    9          6       cycles
LEA R,[1]/L:DEC R/JNZ L/RET:     2          1       cycles   NOTE: we're using LEA instead of MOV here for branch alignment purposes on Athlons
LEA R,[2]/L:DEC R/JNZ L/RET:    10          2       cycles
LEA R,[3]/L:DEC R/JNZ L/RET:    11          3       cycles
LEA R,[4]/L:DEC R/JNZ L/RET:    16          4       cycles
LEA R,[5]/L:DEC R/JNZ L/RET:    21          5       cycles
LEA R,[6]/L:DEC R/JNZ L/RET:    26          6       cycles
LEA R,[7]/L:DEC R/JNZ L/RET:    31          7       cycles
LEA R,[8]/L:DEC R/JNZ L/RET:    36          8       cycles
LEA R,[9]/L:DEC R/JNZ L/RET:    41          9       cycles
LEA R,[10]/L:DEC R/JNZ L/RET:   46         24       cycles
LEA R,[11]/L:DEC R/JNZ L/RET:   51         25       cycles
LEA R,[12]/L:DEC R/JNZ L/RET:   56         26       cycles
LEA R,[13]/L:DEC R/JNZ L/RET:   61         27       cycles
LEA R,[14]/L:DEC R/JNZ L/RET:   66         28       cycles
LEA R,[15]/L:DEC R/JNZ L/RET:   71         29       cycles
LEA R,[16]/L:DEC R/JNZ L/RET:   76         30       cycles
LEA R,[17]/L:DEC R/JNZ L/RET:   81         31       cycles
LEA R,[18]/L:DEC R/JNZ L/RET:   86         32       cycles   NOTE: on Athlon then increases with steps of 2 cycles


Notice the branch prediction effects, expecially the stall that happens on the Athlon at the 10th branch.. in accord with the theory expressed at the bottom of the Agner Fog article's on branch prediction that you can find at http://www.x86.org/articles/branch/branchprediction.htm

Consider, though, that modern CPU's have extremely complex branch prediction hardware, and to get a stable result involving branches is extremely difficult.
The best has been done on PROFILE to give the most stable, precise and consistent results given the following handicaps:

1) The very little time I can dedicate to the development of this freeware tool.
2) PROFILE must work on all CPU's, and must tune to each of them without any specific code.
3) PROFILE must run in ring 3, thus losing all the important benefits that would derive from running it in ring 0.
4) Like in the point above, PROFILE has the further added complication that it must preserve the contents of EAX,EBX,ECX,EDX,ESI,EDI,EBP,ESP, and even of EFL, which will be modified only by the routine being profiled, and not by PROFILE (this unfortunately prevents some neat serializing CPUID solutions).
5) Fighting with the modern and very complex branch prediction mechanisms of e.g. the P-IV and of the Athlon.
6) Assembler-independence: due to the fact that some assemblers (e.g. MASM) aren't always 1:1, and/or offer little or no control over short vs long forms of instructions, or offer little control over code/data alignment, this new version of PROFILE has been released as machine code with instructions (that you must follow completely) on how to use it.

Considering all these problems, the results should look quite impressively precise.

Now I'm gonna show the instructions to use PROFILE (NOTE: YOU MUST FOLLOW THEM EXACTLY):

Setup:
1) ALLOCATE 8 KB WITH 4 KB ALIGNMENT, AND CLEAR IT (using VirtualAlloc to commit two consecutive 4 KB pages is one way to do it. If you aren't totally certain about what you're doing, stick to VirtualAlloc, because it ensures the REQUIRED alignment).
2) LOAD THERE FROM DISK (OR FROM MEMORY) THE CONTENTS OF THE FILE "PROFILE.BIN" THAT YOU FOUND IN THIS ARCHIVE.
3) CALL the CODE at offset 559 from the above. This will make sure all internal relocations have been fixed and now PROFILE is ready for use. Call it just once. Did I say that the offset 559 is a hex value? No, in fact it isn't. It's decimal.

After Setup, Use PROFILE this way:
[Optional] Set up EAX,EBX,ECX,EDX,ESI,EDI,EBP and the flags as you wish; PUSH all eventual parameters of the routine to be tested on the stack, like if you were going to call the routine to be tested directly.
[MustDoIt] PUSH the address of the routine to be tested, then make a CALL inside PROFILE at offset 1.
PROFILE will pass the stack and the CPU register and flags as if you called directly the routine to be tested. On return, at offset 4096, a 64 bit unsigned integer, that you can access also as 32 bit (if you know that it didn't overflow) will contain the exact number of cycles the tested routine took to execute, and the stack and CPU registers/flags will reflect the modifications made by the routine to be tested. If none, then they will be preserved.

NOTE: THE ROUTINE TO BE TESTED GETS CALLED BY PROFILE SEVERAL TIMES. SO YOU'VE TO MAKE SURE THAT IT BEHAVES EXACTLY THE SAME WAY EVERY TIME IT WILL BE CALLED. This means that if the routine to be tested writes/updates and also reads/uses pointers in some memory location (including stack parameters), you've to reset them at the begin of the routine to be tested itself.

NOTE: YOU MUST ALIGN TO A 4 KB PAGE also every routine you want to test (use VirtualAlloc() to be sure you get 4 KB alignment, if you don't know how to provide it yourself).
Only if all routines to be tested are each one at the begin of its own 4 KB page the comparative results will be reliable.
DO NOT TEST ANYTHING THAT ISN'T 4096 BYTES ALIGNED.

Failure to follow with total precision all of the instructions of above may cause more or less subtle malfunctions that at the begin may go unnoticed, and give you in certain situations wrong numbers that you will likely trust. Read carefully the LICENCE at the begin of this document if you forgot what this implies.

A FINAL NOTE:
Some of you may wonder why PROFILE requires a subroutine to test/profile, rather than e.g. inlined code.
There are many reasons that make the former approach (the one used) much more precise and reliable, expecially on modern CPU's with advanced branch prediction, trace cache, etc..
Just to make an example, take into account even only the lone code cache issues. Even if we allow the inlined code to be tested to be aligned with a cache line (let's take the Pentium cache line size as example: 32 bytes), a piece of code 33 bytes long will take *a lot* more than a 32 bytes one to execute.. giving misleading results.. misleading because when applied in a real world, where the code is not guaranteed to be aligned at all, the 33 bytes one may not be any slower than the 32 bytes one instead (it may be even faster). Even worse, on out-of-order capable CPU's (e.g. P-Pro, K7, etc..) things will depend on even more internal factors, which on the first run will be very unpredictable. Branch prediction will make results even less consistent. Let away that our 33 bytes routine may or may not be already cached, entirely or in part, causing one or two cache line loads from memory (which take a non constant amount of time) making the result highly unpredictable. The ~randomness associated with such a profiling technique is too high to make it really useful.

I think that the only reliable way to profile a piece of code, and mostly for comparative aims (i.e. to find/tune the best optimization of a certain routine for a certain CPU) is the one that profiles a subroutine, as used in PROFILE.

However, given the huge limitations and problems I mentioned before, here's what I feel may be the best implementation of an inlined PRE/POST version of PROFILE (but use it at your own "risk"):

        CPUID                           ; CPUID has different execution times the first two times it gets executed.
        MOV     EAX,[.CYCLES]           ; let's warm at least that part of the cache
        FWAIT
        CPUID
        RDTSC
        MOV     [.CYCLES],EAX
        CPUID

.. your code to be tested goes here ..
        
        FWAIT
        CPUID
        RDTSC
        XCHG    EAX,[.CYCLES]
        SUB     [.CYCLES],EAX

.. the (32 bit) number of cycles it took to execute your code to be tested is now in EAX .. remember though that it still has, although reduced, quite a high degree of randomness, so use PROFILE on a subroutine every time you can.

That's all folks.
