flat assembler
Message board for the users of flat assembler.
Index
> Main > AMD Optimization 32t/64 bit |
Author |
|
revolution 28 Nov 2005, 03:48
Quote: So why did AMD make NOP sequences with combos of 66h and 90h when the straight 90h padding runs just as fast ? |
|||
28 Nov 2005, 03:48 |
|
Madis731 28 Nov 2005, 10:01
Its an interesting code you got there. There's definitely an algo there, but I couldn't find it. I see the pattern there though and here's a code to help notice it:
Code: macro padnum [params] { forward if params=2 db "f",90h end if if params=3 db "ff",90h end if if params=4 db "fff",90h end if } macro AMDPad16 { virtual align 16 a = $-$$ end virtual if a=1 db 90h end if if a=2 padnum 2 end if if a=3 padnum 3 end if if a=4 padnum 4 end if if a=5 padnum 3,2 end if if a=6 padnum 3,3 end if if a=7 padnum 4,3 end if if a=8 padnum 4,4 end if if a=9 padnum 3,3,3 end if if a=10 padnum 4,3,3 end if if a=11 padnum 4,4,3 end if if a=12 padnum 4,4,4 end if if a=13 padnum 4,3,3,3 end if if a=14 padnum 4,4,3,3 end if if a=15 padnum 4,4,4,3 end if } I think you can tell FASM with macros that you want the padding to happen with the fewest groups and biggest. Hmm...interesting ^o) |
|||
28 Nov 2005, 10:01 |
|
Madis731 28 Nov 2005, 11:03
I'm optimizing it right now - I think you might want to see the middle stages:
Code: macro padnum [params] { forward if params=1 db 90h end if if params=2 db "f",90h end if if params=3 db "ff",90h end if if params=4 db "fff",90h end if } macro AMDPad16 { virtual align 16 a = $-$$ end virtual if a<5 padnum a;/1 else if a<9 padnum (a+1)/2,a/2 else if a<13 padnum (a+2)/3,(a+1)/3,a/3 else if a<16 padnum (a+3)/4,(a+2)/4,(a+1)/4,a/4 end if } |
|||
28 Nov 2005, 11:03 |
|
Madis731 28 Nov 2005, 11:58
I told you it contains some algo
Code: macro padnum [params] { times params-1 db "f" ;66h db 90h } macro AMDPad16 { virtual align 16 a = $-$$ end virtual ; a+3 c=(a+3) shr 2 ; --- ; 4 repeat c padnum (a-%+c)/c end repeat } EDIT: some comments and crearer code |
|||
28 Nov 2005, 11:58 |
|
LocoDelAssembly 28 Nov 2005, 14:25
I have a question about this, why you don't use sequences of instructions that do nothing?
Example: Instead of Code: if a=4 db 66h,66h,66h,90h end if Why not Code: if a=4 mov eax, eax mov edx, edx end if ? Dissasembling some executables written in high level languages I found they never uses NOPs sequences except in cases where there is no another instruction that fit in the reminder space. Usually they use LEAs in the sequence but I can't find the way to force FASM to assemble "lea eax, [eax+0]", FASM assembles "lea eax, [eax]" which takes much less space. Regards, LocoDelAssembly |
|||
28 Nov 2005, 14:25 |
|
Tomasz Grysztar 28 Nov 2005, 14:41
You can use "lea eax,[dword eax+0]" to force using the 32-bit displacement; but for the 8-bit one you have to define it as byte opcodes, as fasm always optimizes them.
|
|||
28 Nov 2005, 14:41 |
|
LocoDelAssembly 28 Nov 2005, 14:49
Ups I didn't know that
Well now my original idea: Instead of: Code: if a=6 db 66h,66h,90h,66h,66h,90h end if Why not Code: if a=6 lea eax, [dword eax + 0] end if Thanks Tomasz for both replies |
|||
28 Nov 2005, 14:49 |
|
Madis731 28 Nov 2005, 15:05
You can always just jump to the next location. If that is the point. Then you can have strings or whatever inbetween
Code: jmp alignment_label ;When you are off by 10 here db "filling 10" alignment_label: |
|||
28 Nov 2005, 15:05 |
|
LocoDelAssembly 28 Nov 2005, 15:12
Yes, I see that too in dissasemblings, if padding with instructions will spend more cicles than just jumping then JMP is used, but in the other case sequence of instructions that doesn't modifies nothing are used.
|
|||
28 Nov 2005, 15:12 |
|
revolution 28 Nov 2005, 15:47
Using "lea" or "mov" etc. are not optimal ways to get the least activity inside the CPU. AMD specifically state that the 66,66,90 is the most optimal way to "do nothing", they made the processor so I guess they should know best.
|
|||
28 Nov 2005, 15:47 |
|
LocoDelAssembly 28 Nov 2005, 16:37
Code: include 'win32ax.inc' .code start: invoke GetCurrentProcess invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS invoke GetTickCount push eax xor ecx, ecx align 16 .loop: lea eax, [dword eax + 0] db 66h,66h,90h,66h,66h,90h dec ecx jnz .loop invoke GetTickCount pop edx ; Result of previous call to GetTickCount sub eax, edx cinvoke wsprintf, output, fmt, eax invoke MessageBox, 0, output, title, 0 invoke ExitProcess,0 .data fmt db "Loop took %u ms", 0 title db "AMD speed test", 0 output rb 256 .end start Well I can't beleave what I'm seeing, if I comment the suggested AMD sequence the loop took 4360 ms, if I comment the lea again it's take 4360 ms and if I comment both the loop takes 4360 ms. WHAT'S HAPPENING HERE? I'm using an Athlon 3200+ (S939) with WinXP 32-bits. [edit]Now I tried running at 1005 Mhz instead of 2010 Mhz and still happening the same but now it takes 8610 ms. Well I will try another way to take the time measurement.[/edit] |
|||
28 Nov 2005, 16:37 |
|
Madis731 28 Nov 2005, 17:35
Because modern computers don't count ticks but micro-operations. Intel can issue 3 of those in one clock (don't know about AMD).
When you write a jump, any half-tick will be finished and new tick started. Now with this kind of loop - it takes 1 clock no matter what. Thus commenting out doesn't help - you are NOT getting 100% out of your CPU though. I wonder why my PIII laptop (700MHz) takes 43232ms and 12523ms respectively Code: .loop: lea eax, [dword eax + 0] lea eax, [dword eax + 0] lea eax, [dword eax + 0] lea eax, [dword eax + 0] lea eax, [dword eax + 0] db 66h,66h,90h,66h,66h,90h db 66h,66h,90h,66h,66h,90h db 66h,66h,90h,66h,66h,90h db 66h,66h,90h,66h,66h,90h db 66h,66h,90h,66h,66h,90h dec ecx jnz .loop With this kind of code you will minimize the impact from jumps and you can notice some changes |
|||
28 Nov 2005, 17:35 |
|
LocoDelAssembly 28 Nov 2005, 18:23
Well I found better times with this:
Code: .loop: lea eax, [dword eax + 0] lea ebx, [dword ebx + 0] lea ecx, [dword ecx + 0] lea edx, [dword edx + 0] lea eax, [dword eax + 0] dec ecx jnz .loop Works faster than: Code: .loop: lea eax, [dword eax + 0] lea eax, [dword eax + 0] lea eax, [dword eax + 0] lea eax, [dword eax + 0] lea eax, [dword eax + 0] dec ecx jnz .loop Quote: I wonder why my PIII laptop (700MHz) takes 43232ms and 12523ms respectively [edit]Executing in the loop only the bytes sequences takes 10734 ms[/edit] |
|||
28 Nov 2005, 18:23 |
|
Madis731 29 Nov 2005, 09:01
locodelassembly wrote: Which code takes 12523 ms? I tryed with all enabled (bytesequences & leas) and none (only empty loop) |
|||
29 Nov 2005, 09:01 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.