flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
mattst88 15 Apr 2007, 02:28
I know many people have talked about using this (http://www.agner.org/optimize/#testp) to calculate clock cycles, but it requires MASM.
Is there a more FASM-friendly way to do this? Do you just use MASM to assemble this code? Walk me through calculating clock cycles of a tiny snippet of code, please. _________________ My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions. Assembly Programmer's Journal |
|||
![]() |
|
mattst88 15 Apr 2007, 16:32
Xorpd! wrote: If instead you are trying to get a theoretical count for the number of clock cycles some code takes, this is really hard to do. Yes, I believe this is what I'm attempting to do. I always see people on these forums saying that if you replace X instruction you can reduce the clocks from 47 to 43 or something. Xorpd! wrote: P.S. please let me know if you are having as much problems using my website as I am using yours! Could you elaborate? I'd love to fix whatever is causing you grief. Suggestions are welcome also. _________________ My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions. Assembly Programmer's Journal |
|||
![]() |
|
zir_blazer 15 Apr 2007, 17:10
What he wants to know is a way to "Benchmark" the code to know how much times it takes so its more easy to deduct how to optimize the code.
|
|||
![]() |
|
Xorpd! 15 Apr 2007, 22:27
Well, when someone says that a change in his code reduces the clocks from 47 to 43 I take it to mean that he has a test bed all set up for timing his code and he is then reporting the changes in measured clcok cycle counts. However, it's hard to make a single run and get a reliable count of clock cycles like this when the snippet is so short. On the Pentium Classic, no problem, but current out of order processors are rather more problematic.
What it usually means is that that instruction sequence was place in a loop and timed over several iterations, the loop overhead (counting the mispredicted branch at exit) and the latency of RDTSC subtracted off, and then the "active" time divided by the number of iterations. When you do this you don't typically get an integral number of clock cycles, so what is reported is most likely rounded to the nearest integer. While it's necessary to be able to measure the speed of a code sequence like this if you want to strive for the best performance, one must also be able to trace through the code sequence to determine the points where resource conflicts cause delays in the critical path(s) that result in deviations from the time predicted by a more high-level view of the code sequence. The high-level view will model the clock cycle count as determined by a single factor, such as throughput to one port or latency of a critical path. The low-level view is to create a Gantt chart of the instruction sequence so that you can see details such as an instruction which can potentially go to ports 0, 1, or 5 actually issuing to port 0 thereby increasing the number of instructions issued to that port, or port 0 sitting idle for a clock cycle because there is no instruction pending with operands ready, or an instruction in a critical sequence that can't issue on time because its port is blocked on that clock cycle by an earlier instruction that needs the same port. With a Gantt chart for an instruction sequence in hand one can see conflicts like this and try to work around them. AFAIK there is no tool available for constructing such a "Pipeline View" for Intel processors, but it seems that AMD may have such a tool for theirs processors. Making these charts by hand starting with Intel's optimization docs and Agner Fog's docs (you have to use them both because they both have errors and oversights and are incomplete -- common sense and testing helps you to determine what is the truth) is difficult for any but the smallest snippets of code. Oh yes, the smallest snippets of code have another problem: for example, how many clocks does it take to execute mulpd xmm0, xmm1 ? One might say 1 clock because that's the throughput: you could issue a mulpd instruction every clock cycle if all the operands were ready. Another answer is 5 clock cycles because that's the latency: after a mulpd is issued you have to wait 5 clock cycles before you can use the result. My answer is "hopefully 0 clock cycles" because normally one has to issue more addition operations than multiplications so that there is plenty of room on port 0 because the primary throughput limitation is on port 1 and with effort we may be able to hide the latency of the multiplication somehow. I've been too busy to prepare any fun examples lately, sorry. The problem with your x86 instruction reference is that if I go to that page I see a box with 3 combo boxes and an edit control along with some ads. The three combo boxes start out at "All", "", and "Contains" respectively. If I type "movsd" (without the quotes) in the edit control some text appears below that says "1 instructions listed". If I change that to "movsq" it now says "0 instructions listed". Is that all that this reference is supposed to do, or does it print out more information on some browsers than others? |
|||
![]() |
|
mattst88 15 Apr 2007, 22:55
Xorpd! wrote: I've been too busy to prepare any fun examples lately, sorry. The problem with your x86 instruction reference is that if I go to that page I see a box with 3 combo boxes and an edit control along with some ads. The three combo boxes start out at "All", "", and "Contains" respectively. If I type "movsd" (without the quotes) in the edit control some text appears below that says "1 instructions listed". If I change that to "movsq" it now says "0 instructions listed". Is that all that this reference is supposed to do, or does it print out more information on some browsers than others? I'm guessing you're using Internet Explorer. The instruction reference works in Firefox, Opera, Konqueror, Safari, etc., but for whatever reason does not work in Internet Explorer. IE6 completely refuses to do anything useful with it. IE7 though will display the "X instructions listed." which is created with JavaScript but will not create the table of instructions. However, if you do print preview in IE7, the table is created successfully. If there are any JavaScript/Ajax programmers willing to take a look and try to find out what in the world is going wrong with Internet Explorer I'd be very appreciative. Btw, thanks for your very informative reply. So I guess you guys generally roll your own benchmark? _________________ My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions. Assembly Programmer's Journal |
|||
![]() |
|
DOS386 16 Apr 2007, 00:25
Quote: more FASM-friendly way to do this? Do you just use MASM to assemble this code? The code won't assemble in FASM without generous rewrite. ![]() And the package seems to illegally redistribute TASM ... author wrote interesting code but missed something: FASM (maybe NASM/YASM also) ![]() Usage of the RDTSC instruction is simple, the tricky thing in the code (looked at the "DOS" code) is the switch to 64-bit LONG mode from DOS ... _________________ Bug Nr.: 12345 Title: Hello World program compiles to 100 KB !!! Status: Closed: NOT a Bug |
|||
![]() |
|
Hayden 16 Apr 2007, 04:16
I don't see the point of spending hours counting clock cycles as clocks for a given instruction are different from processor to processor.
I can think of better way of optimizing code like... align loops/data/code to avoid cache-miss exemption align far calls/jumps to 16 bytes to ensure cache is loaded effectivly etc... one gets the idea? There are lots fo other ways you can drematicly increase code speed before having to resort to clock counting. _________________ New User.. Hayden McKay. |
|||
![]() |
|
Tomasz Grysztar 16 Apr 2007, 07:02
|
|||
![]() |
|
madmatt 16 Apr 2007, 09:12
I've written some macros using some code from agner fogs example code. (At least, I'm pretty sure it was from agner fog.)
Code: macro STARTTIMER qvara { ;qvara = double cpuid ;force all previous instructions to complete rdtsc ;read time stamp counter mov dword [qvara + 4], edx ;store hi-qword in GLOBAL variable mov dword [qvara], eax ;store lo-qword in GLOBAL variable } macro STOPTIMER qvara, qvarb { ;qvara, qvarb = double cpuid ;force all timed instructions to complete rdtsc ;read time stamp counter mov dword [qvarb + 4], edx ;store hi-qword in second GLOBAL variable mov dword [qvarb], eax ;store lo-qword in second GLOBAL variable fild [qvarb] ;load 64-bit current value into FPU fild [qvara] ;load 64-bit start value into FPU fsubp ;subtract to obtain CPU cycle count fstp [qvara] ;store in first GLOBAL variable } You would use these macros like this, where qvara and qvarb are both double floats. Code: proc timingfunction local qvara:QWORD, qvarb:QWORD ;You should use an empty timer macro first to make sure all pipelined' instructions are finished STARTTIMER qvara STOPTIMER qvara, qvarb STARTTIMER qvara ;qvara = start cycle count ;your code to be timed STOPTIMER qvara, qvarb ;qvara = cycle count, qvarb = end cycle count ;qvara = cycles ellapsed endp |
|||
![]() |
|
asmfan 16 Apr 2007, 11:20
Below is my variation on counting clocks.
_________________ Any offers? |
|||||||||||
![]() |
|
mattst88 17 Apr 2007, 02:02
Thanks. There's some great information in both this and the thread linked by Tomasz.
|
|||
![]() |
|
nasm64developer 27 May 2007, 18:43
There are several issues with your timing code/macros.
You need to serialize both, before and after the RDTSC. You need to use CPUID with a known EAX input value. You need to use MFENCE to fence all loads and stores. You need to save the initial timing in a suitable register. Or warm up the memory location into which you'll put it. You won't get precise timing if the TSC is a constant TSC. In that case your only alternative is to use the PMCs. Last but not least, you need to worry about CPU affinity. |
|||
![]() |
|
nasm64developer 27 May 2007, 18:47
Of course you would also want to warm up the code in the L1i.
As well as align the timing overhead code -- this will of course depend on your target CPU. Taking care of page or cache line alignment is easy; aligning to a particular decoder is a bit more tricky, but not impossible. |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.