flat assembler
Message board for the users of flat assembler.

Index > Main > How to count clock cycles?

Author
Thread Post new topic Reply to topic
mattst88



Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88 15 Apr 2007, 02:28
I know many people have talked about using this (http://www.agner.org/optimize/#testp) to calculate clock cycles, but it requires MASM.

Is there a more FASM-friendly way to do this? Do you just use MASM to assemble this code?

Walk me through calculating clock cycles of a tiny snippet of code, please.

_________________
My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions.
Assembly Programmer's Journal
Post 15 Apr 2007, 02:28
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 15 Apr 2007, 05:47
I'm not sure what you're asking about. Using RDTSC to get times is straighforward as long as you don't make a big deal out of it and interpret the results you get sensibly. I've got an example on my website, memcpy.asm, that uses the instruction to get times for moving 8000 bytes from one location to another (output in memcpy.txt.) Results are returned in edx:eax, which is just right for a 64-bit integer result in 32-bit code. In 64-bit code you need to do

shl rdx, 32
or rax, rdx

to cast the time stamp as a 64-bit result.

So I wonder what you are trying to do... if you are trying to use his driver to access the PMCs, I found that I wanted to translate his C++ code to another language before I could make use of it. Afterwards I discovered that Intel processors don't seem to count the kind of data I wanted anyway, or at least if they do it's not documented in Intel's manual nor in Agner Fog's example. If you want to use PMCs on AMD, I think they may have a more readily usable driver. The may also count more useful events.

If instead you are trying to get a theoretical count for the number of clock cycles some code takes, this is really hard to do. A while back I was trying to do this for a chunk of code that was only about 40 or so clocks long and it took a couple of hours to do the walk-through by hand. If you're on an AMD processor, the "Pipeline View" in their tools package looks like it may be helpful. Intel's VTune is total junk for an assembly programmer, even though their processors are currently the most interesting for us because of the way they bumped up the throughput/latency ratio by a factor of 2 last summer -- no compiler is going to be able to handle this before obsolescence of the processors in question, so hand-written assembly is going to be the only way to fulfill the performance promise of the current crop of Intel processors. Probably will be similar with Barcelona this summer.

So please be a little bit more specific about what you are trying to do.
P.S. please let me know if you are having as much problems using my website as I am using yours!
Post 15 Apr 2007, 05:47
View user's profile Send private message Visit poster's website Reply with quote
mattst88



Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88 15 Apr 2007, 16:32
Xorpd! wrote:
If instead you are trying to get a theoretical count for the number of clock cycles some code takes, this is really hard to do.


Yes, I believe this is what I'm attempting to do. I always see people on these forums saying that if you replace X instruction you can reduce the clocks from 47 to 43 or something.

Xorpd! wrote:
P.S. please let me know if you are having as much problems using my website as I am using yours!


Could you elaborate? I'd love to fix whatever is causing you grief. Suggestions are welcome also.

_________________
My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions.
Assembly Programmer's Journal
Post 15 Apr 2007, 16:32
View user's profile Send private message Visit poster's website Reply with quote
zir_blazer



Joined: 05 Dec 2006
Posts: 66
zir_blazer 15 Apr 2007, 17:10
What he wants to know is a way to "Benchmark" the code to know how much times it takes so its more easy to deduct how to optimize the code.
Post 15 Apr 2007, 17:10
View user's profile Send private message MSN Messenger Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 15 Apr 2007, 22:27
Well, when someone says that a change in his code reduces the clocks from 47 to 43 I take it to mean that he has a test bed all set up for timing his code and he is then reporting the changes in measured clcok cycle counts. However, it's hard to make a single run and get a reliable count of clock cycles like this when the snippet is so short. On the Pentium Classic, no problem, but current out of order processors are rather more problematic.

What it usually means is that that instruction sequence was place in a loop and timed over several iterations, the loop overhead (counting the mispredicted branch at exit) and the latency of RDTSC subtracted off, and then the "active" time divided by the number of iterations. When you do this you don't typically get an integral number of clock cycles, so what is reported is most likely rounded to the nearest integer.

While it's necessary to be able to measure the speed of a code sequence like this if you want to strive for the best performance, one must also be able to trace through the code sequence to determine the points where resource conflicts cause delays in the critical path(s) that result in deviations from the time predicted by a more high-level view of the code sequence.

The high-level view will model the clock cycle count as determined by a single factor, such as throughput to one port or latency of a critical path. The low-level view is to create a Gantt chart of the instruction sequence so that you can see details such as an instruction which can potentially go to ports 0, 1, or 5 actually issuing to port 0 thereby increasing the number of instructions issued to that port, or port 0 sitting idle for a clock cycle because there is no instruction pending with operands ready, or an instruction in a critical sequence that can't issue on time because its port is blocked on that clock cycle by an earlier instruction that needs the same port.

With a Gantt chart for an instruction sequence in hand one can see conflicts like this and try to work around them. AFAIK there is no tool available for constructing such a "Pipeline View" for Intel processors, but it seems that AMD may have such a tool for theirs processors. Making these charts by hand starting with Intel's optimization docs and Agner Fog's docs (you have to use them both because they both have errors and oversights and are incomplete -- common sense and testing helps you to determine what is the truth) is difficult for any but the smallest snippets of code.

Oh yes, the smallest snippets of code have another problem: for example, how many clocks does it take to execute

mulpd xmm0, xmm1

? One might say 1 clock because that's the throughput: you could issue a mulpd instruction every clock cycle if all the operands were ready. Another answer is 5 clock cycles because that's the latency: after a mulpd is issued you have to wait 5 clock cycles before you can use the result. My answer is "hopefully 0 clock cycles" because normally one has to issue more addition operations than multiplications so that there is plenty of room on port 0 because the primary throughput limitation is on port 1 and with effort we may be able to hide the latency of the multiplication somehow.

I've been too busy to prepare any fun examples lately, sorry. The problem with your x86 instruction reference is that if I go to that page I see a box with 3 combo boxes and an edit control along with some ads. The three combo boxes start out at "All", "", and "Contains" respectively. If I type "movsd" (without the quotes) in the edit control some text appears below that says "1 instructions listed". If I change that to "movsq" it now says "0 instructions listed". Is that all that this reference is supposed to do, or does it print out more information on some browsers than others?
Post 15 Apr 2007, 22:27
View user's profile Send private message Visit poster's website Reply with quote
mattst88



Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88 15 Apr 2007, 22:55
Xorpd! wrote:
I've been too busy to prepare any fun examples lately, sorry. The problem with your x86 instruction reference is that if I go to that page I see a box with 3 combo boxes and an edit control along with some ads. The three combo boxes start out at "All", "", and "Contains" respectively. If I type "movsd" (without the quotes) in the edit control some text appears below that says "1 instructions listed". If I change that to "movsq" it now says "0 instructions listed". Is that all that this reference is supposed to do, or does it print out more information on some browsers than others?


I'm guessing you're using Internet Explorer.

The instruction reference works in Firefox, Opera, Konqueror, Safari, etc., but for whatever reason does not work in Internet Explorer. IE6 completely refuses to do anything useful with it. IE7 though will display the "X instructions listed." which is created with JavaScript but will not create the table of instructions. However, if you do print preview in IE7, the table is created successfully.

If there are any JavaScript/Ajax programmers willing to take a look and try to find out what in the world is going wrong with Internet Explorer I'd be very appreciative.

Btw, thanks for your very informative reply. So I guess you guys generally roll your own benchmark?

_________________
My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions.
Assembly Programmer's Journal
Post 15 Apr 2007, 22:55
View user's profile Send private message Visit poster's website Reply with quote
DOS386



Joined: 08 Dec 2006
Posts: 1905
DOS386 16 Apr 2007, 00:25
Quote:
more FASM-friendly way to do this? Do you just use MASM to assemble this code?


The code won't assemble in FASM without generous rewrite. Sad

And the package seems to illegally redistribute TASM ... author wrote
interesting code but missed something: FASM (maybe NASM/YASM also) Sad

Usage of the RDTSC instruction is simple, the tricky thing in the code (looked
at the "DOS" code) is the switch to 64-bit LONG mode from DOS ...

_________________
Bug Nr.: 12345

Title: Hello World program compiles to 100 KB !!!

Status: Closed: NOT a Bug
Post 16 Apr 2007, 00:25
View user's profile Send private message Reply with quote
Hayden



Joined: 06 Oct 2005
Posts: 132
Hayden 16 Apr 2007, 04:16
I don't see the point of spending hours counting clock cycles as clocks for a given instruction are different from processor to processor.

I can think of better way of optimizing code like...

align loops/data/code to avoid cache-miss exemption
align far calls/jumps to 16 bytes to ensure cache is loaded effectivly etc...

one gets the idea?

There are lots fo other ways you can drematicly increase code speed before having to resort to clock counting.

_________________
New User.. Hayden McKay.
Post 16 Apr 2007, 04:16
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8367
Location: Kraków, Poland
Tomasz Grysztar 16 Apr 2007, 07:02
Post 16 Apr 2007, 07:02
View user's profile Send private message Visit poster's website Reply with quote
madmatt



Joined: 07 Oct 2003
Posts: 1045
Location: Michigan, USA
madmatt 16 Apr 2007, 09:12
I've written some macros using some code from agner fogs example code. (At least, I'm pretty sure it was from agner fog.)
Code:
macro STARTTIMER  qvara {            ;qvara = double
      cpuid                          ;force all previous instructions to complete
      rdtsc                          ;read time stamp counter
      mov     dword [qvara + 4], edx ;store hi-qword in GLOBAL variable
      mov     dword [qvara], eax     ;store lo-qword in GLOBAL variable
}

macro STOPTIMER qvara, qvarb {       ;qvara, qvarb = double
      cpuid                          ;force all timed instructions to complete
      rdtsc                          ;read time stamp counter
      mov     dword [qvarb + 4], edx ;store hi-qword in second GLOBAL variable
      mov     dword [qvarb], eax     ;store lo-qword in second GLOBAL variable
      fild    [qvarb]                ;load 64-bit current value into FPU
      fild    [qvara]                ;load 64-bit start value into FPU
      fsubp                          ;subtract to obtain CPU cycle count
      fstp    [qvara]                ;store in first GLOBAL variable
}
    


You would use these macros like this, where qvara and qvarb are both double floats.
Code:
proc timingfunction
     local qvara:QWORD, qvarb:QWORD
     ;You should use an empty timer macro first to make sure all pipelined' instructions are finished
     STARTTIMER qvara
     STOPTIMER qvara, qvarb

     STARTTIMER qvara ;qvara = start cycle count
     ;your code to be timed
     STOPTIMER qvara, qvarb ;qvara = cycle count, qvarb = end cycle count
     ;qvara = cycles ellapsed
endp
    
Post 16 Apr 2007, 09:12
View user's profile Send private message Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan 16 Apr 2007, 11:20
Below is my variation on counting clocks.


Description:
Download
Filename: Profilers.rar
Filesize: 513 Bytes
Downloaded: 565 Time(s)


_________________
Any offers?
Post 16 Apr 2007, 11:20
View user's profile Send private message Reply with quote
mattst88



Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88 17 Apr 2007, 02:02
Thanks. There's some great information in both this and the thread linked by Tomasz.
Post 17 Apr 2007, 02:02
View user's profile Send private message Visit poster's website Reply with quote
nasm64developer



Joined: 11 Jul 2006
Posts: 4
nasm64developer 27 May 2007, 18:43
There are several issues with your timing code/macros.

You need to serialize both, before and after the RDTSC.
You need to use CPUID with a known EAX input value.
You need to use MFENCE to fence all loads and stores.
You need to save the initial timing in a suitable register.
Or warm up the memory location into which you'll put it.

You won't get precise timing if the TSC is a constant TSC.
In that case your only alternative is to use the PMCs.

Last but not least, you need to worry about CPU affinity.
Post 27 May 2007, 18:43
View user's profile Send private message Reply with quote
nasm64developer



Joined: 11 Jul 2006
Posts: 4
nasm64developer 27 May 2007, 18:47
Of course you would also want to warm up the code in the L1i.

As well as align the timing overhead code -- this will of course
depend on your target CPU. Taking care of page or cache line
alignment is easy; aligning to a particular decoder is a bit more
tricky, but not impossible.
Post 27 May 2007, 18:47
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.