flat assembler
Message board for the users of flat assembler.
Index
> Windows > A more elegant way of aligning procedures Goto page 1, 2 Next |
Author |
|
revolution 17 Dec 2012, 11:28
Maybe:
Code: times ((myproc - myproc.critical_loop) and 0xf) nop proc myproc ;... |
|||
17 Dec 2012, 11:28 |
|
Tomasz Grysztar 17 Dec 2012, 11:31
There was a macro for this purpose posted here: http://board.flatassembler.net/topic.php?p=141756#141756
|
|||
17 Dec 2012, 11:31 |
|
nmake 17 Dec 2012, 15:26
I got the same non-solvable problem using the times directive I will look into that macro. If I put that macro in my code, will it override the original align macro?
|
|||
17 Dec 2012, 15:26 |
|
AsmGuru62 17 Dec 2012, 15:57
I always wondered: how come FASM is so blazingly fast,
yet its code never follows these alignment guidelines? The loops are not aligned there or am I missing something? |
|||
17 Dec 2012, 15:57 |
|
revolution 17 Dec 2012, 16:01
AsmGuru62: Yeah. Actually loop alignment is usually only going to give a very minor effect on execution speed. And things like this make me wonder if the the more important algorithmic and cache hit improvements have been properly applied before going into the nitty-gritty minor things things like this alignment problem.
|
|||
17 Dec 2012, 16:01 |
|
nmake 17 Dec 2012, 16:10
alignment is that little secret thing that doesn't seem to do any good to any program, barely noticeable, but if you overlook it, it can do damage behind your back. It is a habit that needs to be dealt with, without having expectations that it will do miracles for you.
More important is cache behavior, reducing loop overhead, loop unrolling, spreading microoperations evenly across execution ports if possible at the retirement rate of your processor. Using the right execution units for the right job at the right time. Many c++ coders will tell you algorithms is the most important thing of all, but if you implement an algorithm with bad code it can perform very bad. I could give you an example, GNU sort uses a sophisticated algorithm of (thousands) of lines of code. I skipped that algorithm, produced my own sort algorithm using brute force and it performed over a thousand percent better than gnu sort, even when I didn't use a good algorithm, merely brute force, straight forward. |
|||
17 Dec 2012, 16:10 |
|
nmake 17 Dec 2012, 16:23
AsmGuru62 wrote: I always wondered: how come FASM is so blazingly fast, In this perticular case alignment is all about instruction prefetching, to me at least. But it is important when coding simd and also important in cache behavior. A misaligned piece of code that is above 64 bytes can have a huge penalthy in a critical loop. You also waste cache lines if you use 3 sets of 64 bytes that could have fit in 2 sets of 64 byte lines, effectively increasing cache pollution by one third. |
|||
17 Dec 2012, 16:23 |
|
revolution 17 Dec 2012, 16:25
nmake wrote: I got the same non-solvable problem using the times directive Code: align 16 times ((myproc - myproc.critical_loop) and 0xf) nop proc myproc ;... |
|||
17 Dec 2012, 16:25 |
|
Xorpd! 17 Dec 2012, 16:56
I have noticed that not only aligning a branch target on a 16 byte boundary, but ensuring that the next 16 byte boundary is also an instruction boundary also seems to help performance on my old Core 2 Duo. Maybe it's just a superstition and based on limited experience, but it seems to make a small but perceptible difference.
Aligning with nops doesn't seem to be a good idea in this case because those extra instructions will then have to be decoded in the inner loop. There are some instructions such as movdqa that have equivalent forms with different lengths (although the form that moves the instruction to a different port is unique) and also in pipelined code it may be possible to permute the order of some instructions. I can't see how an assembler can carry out these kinds of optimizations unassisted because the programmer may have wanted to use the specific forms of instructions coded to watermark his code. Of course FASM doesn't have full support for watermarking anyhow so you probably would be using another assembler if you wanted to do that. To check alignment I use the even more crude method of examining the assembler output with dumpbin. |
|||
17 Dec 2012, 16:56 |
|
LocoDelAssembly 17 Dec 2012, 17:17
Have any of you take a moment to see Tomasz's post? This seems to work just fine:
Code: include 'win32axp.inc' macro align value,addr=$ { local base,size if addr>$ base = addr-size size = ((base+value-1)/value*value-base) db size dup 90h else db ((addr+value-1)/value*value-addr) dup 90h end if } proc start local buff[256]:BYTE stdcall fib, 6 cinvoke wsprintf, addr buff, <"Result: %u", 13, 10>, eax invoke MessageBox, 0, addr buff, "Align test", 0 invoke ExitProcess, 0 endp if used fib align 16, fib.loop end if proc fib, n mov ecx, [n] mov eax, 1 xor edx, edx test ecx, ecx jz .exit .loop: xadd eax, edx dec ecx jnz .loop .exit: ret endp .end start Code: CPU Disasm Address Hex dump Command Comments 00401000 /. 55 PUSH EBP 00401001 |. 89E5 MOV EBP,ESP 00401003 |. 81EC 00010000 SUB ESP,100 00401009 |. 6A 06 PUSH 6 ; /Arg1 = 6 0040100B |. E8 4F000000 CALL 0040105F ; \fiuwiue.0040105F 00401010 |. 50 PUSH EAX ; /<%u> 00401011 |. E8 0D000000 CALL 00401023 ; |Format = "", jump over immediate data 00401016 |. 52 65 73 75 6 ASCII "Result: %u ",0 ; |ASCII "Result: %u " 00401023 |> 8D95 00FFFFFF LEA EDX,[LOCAL.64] ; | 00401029 |. 52 PUSH EDX ; |Buf => OFFSET LOCAL.64 0040102A |. FF15 88204000 CALL DWORD PTR DS:[<&USER32.wsprintfA>] ; \USER32.wsprintfA 00401030 |. 83C4 0C ADD ESP,0C 00401033 |. 6A 00 PUSH 0 ; /Type = MB_OK|MB_DEFBUTTON1|MB_APPLMODAL 00401035 |. E8 0B000000 CALL 00401045 ; |Caption => "Align test", jump over immediate data 0040103A |. 41 6C 69 67 6 ASCII "Align test",0 ; |ASCII "Align test" 00401045 |> 8D95 00FFFFFF LEA EDX,[LOCAL.64] ; | 0040104B |. 52 PUSH EDX ; |Text => OFFSET LOCAL.64 0040104C |. 6A 00 PUSH 0 ; |hOwner = NULL 0040104E |. FF15 84204000 CALL DWORD PTR DS:[<&USER32.MessageBoxA> ; \USER32.MessageBoxA 00401054 |. 6A 00 PUSH 0 ; /ExitCode = 0 00401056 |. FF15 60204000 CALL DWORD PTR DS:[<&KERNEL32.ExitProces ; \KERNEL32.ExitProcess 0040105C |. 90 NOP ; Nops added by Tomasz's align macro 0040105D |. 90 NOP 0040105E |. 90 NOP 0040105F |$ 55 PUSH EBP ; fiuwiue.0040105F(guessed Arg1) 00401060 |. 89E5 MOV EBP,ESP 00401062 |. 8B4D 08 MOV ECX,DWORD PTR SS:[EBP+8] 00401065 |. B8 01000000 MOV EAX,1 0040106A |. 31D2 XOR EDX,EDX 0040106C |. 85C9 TEST ECX,ECX 0040106E |. 74 06 JE SHORT 00401076 00401070 |> 0FC1D0 /XADD EAX,EDX ; Loop is aligned 00401073 |. 49 |DEC ECX 00401074 |.^ 75 FA \JNE SHORT 00401070 00401076 |> C9 LEAVE 00401077 \. C2 0400 RETN 4 |
|||
17 Dec 2012, 17:17 |
|
nmake 17 Dec 2012, 17:24
I did look at it and will try it very soon
What version of Ollydbg do you use btw? |
|||
17 Dec 2012, 17:24 |
|
LocoDelAssembly 17 Dec 2012, 17:45
OllyDbg v2.01 (alpha 4), which I believe is seriously outdated now.
|
|||
17 Dec 2012, 17:45 |
|
revolution 18 Dec 2012, 09:09
nmake wrote: Many c++ coders will tell you algorithms is the most important thing of all, but if you implement an algorithm with bad code it can perform very bad. nmake wrote: I could give you an example, GNU sort uses a sophisticated algorithm of (thousands) of lines of code. I skipped that algorithm, produced my own sort algorithm using brute force and it performed over a thousand percent better than gnu sort, even when I didn't use a good algorithm, merely brute force, straight forward. In summary: Choosing the right algorithm for the task is what is important. And the right algorithm may not always be the one with the lowest O() value. One should not expect to get fantastic results when using a generic library function in critical code. |
|||
18 Dec 2012, 09:09 |
|
nmake 18 Dec 2012, 10:07
Library or no library, there is not much wrong you can do in a sorting algorithm, load from disk, split files apart, sort parts, merge parts and save to disk again. Sorting bits of data in memory, there is not much wrong you can do here, unless you do something VERY wrong, which in this case gnu sort have done something very wrong. No matter how bad you code something, it should still perform pretty descent when dealing with data in memory, but it doesn't with gnu sort. They messed up, they thought a good algorithm would solve it, instead they messed it up by using bad code.
Complex algorithms in the hands of people who believe in the word "algorithm" may or may not work, it depends on luck. But simplicity of mind with a little bit of hardware understanding, you can do wonders without the same algorithm, and you can do even better with a good algorithm. An algorithm is only good if an algorithm is technically and equally adaptive for the hardware. There are people who believe all algorithms are fit for hardware, but it matters how you code it. Before using an algorithm you have to weight it. Does it go back and forward in a zig-zag order (possibly creating bandwidth problems or cache problems), does the algorithm leave room to do other tasks in between different section of the algorithm, does this effect outweight the effect of using the algorithm, does the algorithm rely in a specific technological feature of the computer or not. There are many things to consider and sometimes it just simply pays off to drop everything in the junkyard and settle for pure and simplicity. |
|||
18 Dec 2012, 10:07 |
|
nmake 18 Dec 2012, 10:22
LocoDelAssembly wrote: OllyDbg v2.01 (alpha 4), which I believe is seriously outdated now. I downloaded immunity debugger today, ironically it needs python runtimes (I just made a thread on this forum about python) It is amazingly like ollydbg, it even uses same bitmaps on buttons in addition it has a graphical view just like ida pro have. |
|||
18 Dec 2012, 10:22 |
|
DOS386 20 Dec 2012, 02:00
AsmGuru62 wrote: I always wondered: how come FASM is so blazingly fast, Most likely people overestimate the benefits of alignment. If you get 5% speedup while measuring an isolated loop, you will barely get 5% speedup of your complete program. Excessive aligning of everything increases bloat, thus it also increases cache and page misses, as the CPU walks through all those functions distributed around your app |
|||
20 Dec 2012, 02:00 |
|
JohnFound 20 Dec 2012, 05:49
The old versions of FASM was slower, especially on big sources. The big speedups became after several algorithmic changes Tomasz made. For example: some benchmarks
So, the conclusions: 1. If you need alignments in order to speedup you program, it is time to think about better algorithm. 2. Make a program small enough and it will be fast enough. |
|||
20 Dec 2012, 05:49 |
|
nmake 20 Dec 2012, 06:03
The misunderstanding is that many THINK it is about speed and then they discover it didn't give them so much speed and then they ask why it doesn't give them all that much speed gains.
It is completely misunderstood, alignment is not about speed, although it will give you speed improvements, it is not about speed. The whole thing about alignment is misunderstood, that is why the speed questions comes up to begin with, because people don't know what alignment is for. If you knew what alignment was for you wouldn't ask why it doesn't give you speed gains. Ask yourself, why is alignment and speed-gains being discussed in the same context? It is sort of like buying new tires for your bicycle, you buy new tires to get better grip, but it also happens to produce less friction so you gain more speed on your bicycle, then people might ask, why doesnt these tires give me more speed gains? Because the tires are designed to give better grip, not better speed. The tiny extra speed gains is just a side effect. The same idiot might go back to the bicycle store and start to complain that these tires doesn't give better speed, and the seller hit him in the head with a large phone book, telling him, these tires is not about speed gains. The idiot then responds by producing large wetty eyes, and he walks out of the store in shame. 1. Alignment is about allowing for simd operations. 2. Alignment is about instruction prefetching (in this particular case) 3. Alignment is about cache pollution. 4. AND, it is about adding that tiny extra speed, but don't confuse this point, this point is necessary to have the 3 first points validate, and the first 3 points is about other things than speed gains. It is easy to become confused here. Last edited by nmake on 20 Dec 2012, 06:24; edited 1 time in total |
|||
20 Dec 2012, 06:03 |
|
revolution 20 Dec 2012, 06:20
So if it is not to gain any speed then will you be kind enough to tell us what the alignment is for?
|
|||
20 Dec 2012, 06:20 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.