flat assembler
Message board for the users of flat assembler.

Index > Windows > A more elegant way of aligning procedures

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
nmake



Joined: 13 Sep 2012
Posts: 192
nmake 17 Dec 2012, 11:23
I am looking for a more elegant way of aligning procedures with critical loops. Normally, if you align your innermost loop, it adds nop instructions before the loop and after entry code. But the best way to align a function is to move the entire function forward so that the critical loop end up exactly on a 16 byte boundary, effectively eliminating all nop instructions between entry and loop.

I do this manually at the moment, I've tried different way of using virtual and padding automatically, but the assembler will often not be able to solve it and refuse to compile. I am not sure if you can use load and store for this.

Here is the way I do it:

Code:
db 7 dup $90
proc myproc
  some entry code
  some entry code
  some entry code

 if $ mod 16 = 0
    display 'loop properly aligned.'
 end if

 .critical_loop:
  critical code
  critical code
  critical code
  jmp critical_loop

  ret
endp
    


This will manually tell me if the loop is aligned or not, then I manually add more bytes right before the procedure to move the entire procedure forward, until the loop is aligned.

Can it be done more effectively using macros?
Post 17 Dec 2012, 11:23
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20410
Location: In your JS exploiting you and your system
revolution 17 Dec 2012, 11:28
Maybe:
Code:
times ((myproc - myproc.critical_loop) and 0xf) nop
proc myproc
;...    
Post 17 Dec 2012, 11:28
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8356
Location: Kraków, Poland
Tomasz Grysztar 17 Dec 2012, 11:31
There was a macro for this purpose posted here: http://board.flatassembler.net/topic.php?p=141756#141756
Post 17 Dec 2012, 11:31
View user's profile Send private message Visit poster's website Reply with quote
nmake



Joined: 13 Sep 2012
Posts: 192
nmake 17 Dec 2012, 15:26
I got the same non-solvable problem using the times directive Sad I will look into that macro. If I put that macro in my code, will it override the original align macro?
Post 17 Dec 2012, 15:26
View user's profile Send private message Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1657
Location: Toronto, Canada
AsmGuru62 17 Dec 2012, 15:57
I always wondered: how come FASM is so blazingly fast,
yet its code never follows these alignment guidelines?
The loops are not aligned there or am I missing something?
Post 17 Dec 2012, 15:57
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20410
Location: In your JS exploiting you and your system
revolution 17 Dec 2012, 16:01
AsmGuru62: Yeah. Actually loop alignment is usually only going to give a very minor effect on execution speed. And things like this make me wonder if the the more important algorithmic and cache hit improvements have been properly applied before going into the nitty-gritty minor things things like this alignment problem.
Post 17 Dec 2012, 16:01
View user's profile Send private message Visit poster's website Reply with quote
nmake



Joined: 13 Sep 2012
Posts: 192
nmake 17 Dec 2012, 16:10
alignment is that little secret thing that doesn't seem to do any good to any program, barely noticeable, but if you overlook it, it can do damage behind your back. It is a habit that needs to be dealt with, without having expectations that it will do miracles for you.

More important is cache behavior, reducing loop overhead, loop unrolling, spreading microoperations evenly across execution ports if possible at the retirement rate of your processor. Using the right execution units for the right job at the right time.

Many c++ coders will tell you algorithms is the most important thing of all, but if you implement an algorithm with bad code it can perform very bad. I could give you an example, GNU sort uses a sophisticated algorithm of (thousands) of lines of code. I skipped that algorithm, produced my own sort algorithm using brute force and it performed over a thousand percent better than gnu sort, even when I didn't use a good algorithm, merely brute force, straight forward.
Post 17 Dec 2012, 16:10
View user's profile Send private message Reply with quote
nmake



Joined: 13 Sep 2012
Posts: 192
nmake 17 Dec 2012, 16:23
AsmGuru62 wrote:
I always wondered: how come FASM is so blazingly fast,
yet its code never follows these alignment guidelines?
The loops are not aligned there or am I missing something?


In this perticular case alignment is all about instruction prefetching, to me at least. But it is important when coding simd and also important in cache behavior. A misaligned piece of code that is above 64 bytes can have a huge penalthy in a critical loop. You also waste cache lines if you use 3 sets of 64 bytes that could have fit in 2 sets of 64 byte lines, effectively increasing cache pollution by one third.
Post 17 Dec 2012, 16:23
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20410
Location: In your JS exploiting you and your system
revolution 17 Dec 2012, 16:25
nmake wrote:
I got the same non-solvable problem using the times directive
I left out an important step:
Code:
align 16
times ((myproc - myproc.critical_loop) and 0xf) nop
proc myproc
;...    
Post 17 Dec 2012, 16:25
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 17 Dec 2012, 16:56
I have noticed that not only aligning a branch target on a 16 byte boundary, but ensuring that the next 16 byte boundary is also an instruction boundary also seems to help performance on my old Core 2 Duo. Maybe it's just a superstition and based on limited experience, but it seems to make a small but perceptible difference.
Aligning with nops doesn't seem to be a good idea in this case because those extra instructions will then have to be decoded in the inner loop. There are some instructions such as movdqa that have equivalent forms with different lengths (although the form that moves the instruction to a different port is unique) and also in pipelined code it may be possible to permute the order of some instructions.
I can't see how an assembler can carry out these kinds of optimizations unassisted because the programmer may have wanted to use the specific forms of instructions coded to watermark his code. Of course FASM doesn't have full support for watermarking anyhow so you probably would be using another assembler if you wanted to do that.
To check alignment I use the even more crude method of examining the assembler output with dumpbin.
Post 17 Dec 2012, 16:56
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 17 Dec 2012, 17:17
Have any of you take a moment to see Tomasz's post? This seems to work just fine:
Code:
include 'win32axp.inc'

macro align value,addr=$
{ 
  local base,size 
  if addr>$
    base = addr-size
    size = ((base+value-1)/value*value-base)
    db size dup 90h
  else
    db ((addr+value-1)/value*value-addr) dup 90h
  end if
}

proc start
local buff[256]:BYTE

  stdcall fib, 6
  cinvoke wsprintf, addr buff, <"Result: %u", 13, 10>, eax
  invoke  MessageBox, 0, addr buff, "Align test", 0
  invoke  ExitProcess, 0
endp

if used fib
  align 16, fib.loop
end if

proc fib, n

      mov     ecx, [n]
      mov     eax, 1
      xor     edx, edx
      test    ecx, ecx
      jz      .exit

.loop:
      xadd    eax, edx
      dec     ecx
      jnz     .loop

.exit:
      ret
endp

.end start    
OllyDbg:
Code:
CPU Disasm
Address   Hex dump          Command                                  Comments
00401000  /.  55            PUSH EBP
00401001  |.  89E5          MOV EBP,ESP
00401003  |.  81EC 00010000 SUB ESP,100
00401009  |.  6A 06         PUSH 6                                   ; /Arg1 = 6
0040100B  |.  E8 4F000000   CALL 0040105F                            ; \fiuwiue.0040105F
00401010  |.  50            PUSH EAX                                 ; /<%u>
00401011  |.  E8 0D000000   CALL 00401023                            ; |Format = "", jump over immediate data
00401016  |.  52 65 73 75 6 ASCII "Result: %u
",0                   ; |ASCII "Result: %u
"
00401023  |>  8D95 00FFFFFF LEA EDX,[LOCAL.64]                       ; |
00401029  |.  52            PUSH EDX                                 ; |Buf => OFFSET LOCAL.64
0040102A  |.  FF15 88204000 CALL DWORD PTR DS:[<&USER32.wsprintfA>]  ; \USER32.wsprintfA
00401030  |.  83C4 0C       ADD ESP,0C
00401033  |.  6A 00         PUSH 0                                   ; /Type = MB_OK|MB_DEFBUTTON1|MB_APPLMODAL
00401035  |.  E8 0B000000   CALL 00401045                            ; |Caption => "Align test", jump over immediate data
0040103A  |.  41 6C 69 67 6 ASCII "Align test",0                     ; |ASCII "Align test"
00401045  |>  8D95 00FFFFFF LEA EDX,[LOCAL.64]                       ; |
0040104B  |.  52            PUSH EDX                                 ; |Text => OFFSET LOCAL.64
0040104C  |.  6A 00         PUSH 0                                   ; |hOwner = NULL
0040104E  |.  FF15 84204000 CALL DWORD PTR DS:[<&USER32.MessageBoxA> ; \USER32.MessageBoxA
00401054  |.  6A 00         PUSH 0                                   ; /ExitCode = 0
00401056  |.  FF15 60204000 CALL DWORD PTR DS:[<&KERNEL32.ExitProces ; \KERNEL32.ExitProcess
0040105C  |.  90            NOP                                      ; Nops added by Tomasz's align macro
0040105D  |.  90            NOP
0040105E  |.  90            NOP
0040105F  |$  55            PUSH EBP                                 ; fiuwiue.0040105F(guessed Arg1)
00401060  |.  89E5          MOV EBP,ESP
00401062  |.  8B4D 08       MOV ECX,DWORD PTR SS:[EBP+8]
00401065  |.  B8 01000000   MOV EAX,1
0040106A  |.  31D2          XOR EDX,EDX
0040106C  |.  85C9          TEST ECX,ECX
0040106E  |.  74 06         JE SHORT 00401076
00401070  |>  0FC1D0        /XADD EAX,EDX                            ; Loop is aligned
00401073  |.  49            |DEC ECX
00401074  |.^ 75 FA         \JNE SHORT 00401070
00401076  |>  C9            LEAVE
00401077  \.  C2 0400       RETN 4
    
Post 17 Dec 2012, 17:17
View user's profile Send private message Reply with quote
nmake



Joined: 13 Sep 2012
Posts: 192
nmake 17 Dec 2012, 17:24
I did look at it and will try it very soon Smile

What version of Ollydbg do you use btw?
Post 17 Dec 2012, 17:24
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 17 Dec 2012, 17:45
OllyDbg v2.01 (alpha 4), which I believe is seriously outdated now.
Post 17 Dec 2012, 17:45
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20410
Location: In your JS exploiting you and your system
revolution 18 Dec 2012, 09:09
nmake wrote:
Many c++ coders will tell you algorithms is the most important thing of all, but if you implement an algorithm with bad code it can perform very bad.
Using the proper algorithm is very important, of course. But we must be aware of the expected ranges our code is working within. Simply choosing the "best" algorithm based upon some vague O() notation is not always the right choice.
nmake wrote:
I could give you an example, GNU sort uses a sophisticated algorithm of (thousands) of lines of code. I skipped that algorithm, produced my own sort algorithm using brute force and it performed over a thousand percent better than gnu sort, even when I didn't use a good algorithm, merely brute force, straight forward.
This is normal. A library function cannot expect to be best in all situations. One can only hope that at best it performs well for the average case (or perhaps for the worst case?). But many programs do not operate within the average case and when trying to use a general library function you can expect to get average results. Writing specific code for each situation will always give better results than the general case. But specific code also is not suitable for the general case and may perform terribly in the worst case. As long as the inputs are known and under good control then the worst case might never arise.

In summary: Choosing the right algorithm for the task is what is important. And the right algorithm may not always be the one with the lowest O() value.

One should not expect to get fantastic results when using a generic library function in critical code.
Post 18 Dec 2012, 09:09
View user's profile Send private message Visit poster's website Reply with quote
nmake



Joined: 13 Sep 2012
Posts: 192
nmake 18 Dec 2012, 10:07
Library or no library, there is not much wrong you can do in a sorting algorithm, load from disk, split files apart, sort parts, merge parts and save to disk again. Sorting bits of data in memory, there is not much wrong you can do here, unless you do something VERY wrong, which in this case gnu sort have done something very wrong. No matter how bad you code something, it should still perform pretty descent when dealing with data in memory, but it doesn't with gnu sort. They messed up, they thought a good algorithm would solve it, instead they messed it up by using bad code.

Complex algorithms in the hands of people who believe in the word "algorithm" may or may not work, it depends on luck. But simplicity of mind with a little bit of hardware understanding, you can do wonders without the same algorithm, and you can do even better with a good algorithm. An algorithm is only good if an algorithm is technically and equally adaptive for the hardware. There are people who believe all algorithms are fit for hardware, but it matters how you code it.

Before using an algorithm you have to weight it. Does it go back and forward in a zig-zag order (possibly creating bandwidth problems or cache problems), does the algorithm leave room to do other tasks in between different section of the algorithm, does this effect outweight the effect of using the algorithm, does the algorithm rely in a specific technological feature of the computer or not. There are many things to consider and sometimes it just simply pays off to drop everything in the junkyard and settle for pure and simplicity.
Post 18 Dec 2012, 10:07
View user's profile Send private message Reply with quote
nmake



Joined: 13 Sep 2012
Posts: 192
nmake 18 Dec 2012, 10:22
LocoDelAssembly wrote:
OllyDbg v2.01 (alpha 4), which I believe is seriously outdated now.


I downloaded immunity debugger today, ironically it needs python runtimes (I just made a thread on this forum about python)

It is amazingly like ollydbg, it even uses same bitmaps on buttons in addition it has a graphical view just like ida pro have.
Post 18 Dec 2012, 10:22
View user's profile Send private message Reply with quote
DOS386



Joined: 08 Dec 2006
Posts: 1903
DOS386 20 Dec 2012, 02:00
AsmGuru62 wrote:
I always wondered: how come FASM is so blazingly fast,
yet its code never follows these alignment guidelines?
The loops are not aligned there or am I missing something?


Most likely people overestimate the benefits of alignment. If you get 5% speedup while measuring an isolated loop, you will barely get 5% speedup of your complete program. Excessive aligning of everything increases bloat, thus it also increases cache and page misses, as the CPU walks through all those functions distributed around your app Wink
Post 20 Dec 2012, 02:00
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria
JohnFound 20 Dec 2012, 05:49
The old versions of FASM was slower, especially on big sources. The big speedups became after several algorithmic changes Tomasz made. For example: some benchmarks

So, the conclusions:
1. If you need alignments in order to speedup you program, it is time to think about better algorithm.
2. Make a program small enough and it will be fast enough.
Post 20 Dec 2012, 05:49
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
nmake



Joined: 13 Sep 2012
Posts: 192
nmake 20 Dec 2012, 06:03
The misunderstanding is that many THINK it is about speed and then they discover it didn't give them so much speed and then they ask why it doesn't give them all that much speed gains.

It is completely misunderstood, alignment is not about speed, although it will give you speed improvements, it is not about speed. The whole thing about alignment is misunderstood, that is why the speed questions comes up to begin with, because people don't know what alignment is for. If you knew what alignment was for you wouldn't ask why it doesn't give you speed gains.

Ask yourself, why is alignment and speed-gains being discussed in the same context? Smile

It is sort of like buying new tires for your bicycle, you buy new tires to get better grip, but it also happens to produce less friction so you gain more speed on your bicycle, then people might ask, why doesnt these tires give me more speed gains? Because the tires are designed to give better grip, not better speed. The tiny extra speed gains is just a side effect.

The same idiot might go back to the bicycle store and start to complain that these tires doesn't give better speed, and the seller hit him in the head with a large phone book, telling him, these tires is not about speed gains. The idiot then responds by producing large wetty eyes, and he walks out of the store in shame. Very Happy

1. Alignment is about allowing for simd operations.
2. Alignment is about instruction prefetching (in this particular case)
3. Alignment is about cache pollution.
4. AND, it is about adding that tiny extra speed, but don't confuse this point, this point is necessary to have the 3 first points validate, and the first 3 points is about other things than speed gains.

It is easy to become confused here.


Last edited by nmake on 20 Dec 2012, 06:24; edited 1 time in total
Post 20 Dec 2012, 06:03
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20410
Location: In your JS exploiting you and your system
revolution 20 Dec 2012, 06:20
So if it is not to gain any speed then will you be kind enough to tell us what the alignment is for?
Post 20 Dec 2012, 06:20
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.