flat assembler
Message board for the users of flat assembler.

Index > Windows > High performance NOP for x87 !?

Author
Thread Post new topic Reply to topic
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Hi there,

I'm currently optimizing my fractal code regarding the x87 fpu version. I found in the AMD optimization guide "Software Optimization Guide for AMD Family 10h Processors" something interesting:

"Typically three DirectPath instructions occupy 7 bytes. Maintaining 8-byte alignment for the next group of three instructions requires the addition of a single byte. A 1-byte padding can easily be achieved using the single-byte NOP instruction (opcode 90h), as recommended in “Code Padding with Operand-Size Override and Multibyte NOP” on page 68. However, for the special case of x87 instructions, the operand-size override (66h) serves as a high-performance NOP instruction and is the recommended choice for padding an x87 instruction without altering its behavior, as shown here:

DB 066h ; Operand-size override used as high-performance NOP instruction

This usage of the operand-size override alone as a filler byte (without an accompanying NOP instruction) is permitted only for x87 instructions. This usage of the operand-size override can be applied to all but four of the x87 instructions. The FLDENV, FRSTOR, FSTENV, and FSAVE instructions and their no-wait forms behave differently when associated with an operand-size override; therefore, these should not be padded with the operand-size override."

Did anybody here use that and gained a big performance plus in his code ? I tried but the effect was let's say less than 1 %...and may be it will affect Intel cpu's badly ?
Post 09 Apr 2008, 22:59
View user's profile Send private message Visit poster's website Reply with quote
Remy Vincent



Joined: 16 Sep 2005
Posts: 155
Location: France
Remy Vincent
These kind of optimization could very very easily be done with just an ADA directive... Why don't you wait for "high diploma people" to organize ADA new reunions and decide new directives compiler ...
Post 10 Apr 2008, 00:03
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2913
Location: [RSP+8*5]
bitRAKE
I'm assuming the prefix should preceed the instruction...

66 00 01 02 03 04 05 06

...and not...

00 01 02 03 04 05 06 66

My AMD optimizations haven't ventured into the FPU arena - I wasn't even aware of this use of 66. At the time AMD's FPU implementation was much better than Intel's - since they brought over the guys from Digital Equipment Corporation that worked on the Alpha.

_________________
¯\(°_o)/¯ unlicense.org


Last edited by bitRAKE on 10 Apr 2008, 14:25; edited 1 time in total
Post 10 Apr 2008, 14:21
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP
Hmm, I've seen 0x66 mentioned as padding, I've seen it literally in the middle of 0x90 bytes for padding, but I have nothing to contribute to this discussion.
Post 10 Apr 2008, 14:23
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
I've always thought that optimising nops (I'm referring to opcode 0x90, not the prefix 0x66) was a bit pointless. Back in the (good? bad?) old 8086 days nops were used for delays, and of course there was no sense in optimising delays. Now fast forward a few years to modern CPUs, we use nops for alignment before entering a loop, so the loop should be optimised to death and the pre-nops are kind of like the .001% icing on the cake.
Post 10 Apr 2008, 14:37
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2913
Location: [RSP+8*5]
bitRAKE
AMD also recommended using all kinds of "NOP" to reduce false dependancies with adjacent code.
Code:
;MOV REG, REG
;XCHG REG, REG
;CMOVcc REG, REG
;SHR REG, 0
;SAR REG, 0
;SHL REG, 0
;SHRD REG, REG, 0
;SHLD REG, REG, 0
;LEA REG, [REG]
;LEA REG, [REG+00]
;LEA REG, [REG*1+00]
;LEA REG, [REG+00000000]
;LEA REG, [REG*1+00000000]

NOP2_EAX EQU <DB 8Bh,0C0h>     ;MOV EAX, EAX
NOP2_ECX EQU <DB 8Bh,0C9h>       ;MOV ECX, ECX
NOP2_EDX EQU <DB 8Bh,0D2h>       ;MOV EDX, EDX
NOP2_EBX EQU <DB 8Bh,0DBh>       ;MOV EBX, EBX
NOP2_ESP EQU <DB 8Bh,0E4h>       ;MOV ESP, ESP
NOP2_EBP EQU <DB 8Bh,0EDh>       ;MOV EBP, EBP
NOP2_ESI EQU <DB 8Bh,0F6h>       ;MOV ESI, ESI
NOP2_EDI EQU <DB 8Bh,0FFh>       ;MOV EDI, EDI

; No SIB byte, source in ModR/M byte:
;NOP2_EAX EQU <DB 8Dh,00h>      ; lea eax, [eax]
;NOP2_ECX EQU <DB 8Dh,09h>    ; lea ecx, [ecx]
;NOP2_EDX EQU <DB 8Dh,12h>    ; lea edx, [edx]
;NOP2_EBX EQU <DB 8Dh,1Bh>    ; lea ebx, [ebx]
;NOP2_ESI EQU <DB 8Dh,36h>    ; lea esi, [esi]
;NOP2_EDI EQU <DB 8Dh,3Fh>    ; lea edi, [edi]

; SIB byte to select source:
NOP3_EAX EQU <DB 8Dh,04h,20h> ;LEA EAX, [EAX]
NOP3_ECX EQU <DB 8Dh,0Ch,21h>  ;LEA ECX, [ECX]
NOP3_EDX EQU <DB 8Dh,14h,22h>  ;LEA EDX, [EDX]
NOP3_EBX EQU <DB 8Dh,1Ch,23h>  ;LEA EBX, [EBX]
NOP3_ESI EQU <DB 8Dh,24h,24h>  ;LEA ESP, [ESP]
NOP3_EDI EQU <DB 8Dh,34h,26h>  ;LEA ESI, [ESI]
NOP3_ESP EQU <DB 8Dh,3Ch,27h>  ;LEA EDI, [EDI]

; No SIB byte, but add signed byte:
;NOP3_EAX EQU <DB 8Dh,40h,00h>  ; lea eax, [eax+00]
;NOP3_ECX EQU <DB 8Dh,49h,00h>     ; lea ecx, [ecx+00]
;NOP3_EDX EQU <DB 8Dh,52h,00h>     ; lea edx, [edx+00]
;NOP3_EBX EQU <DB 8Dh,5Bh,00h>     ; lea ebx, [ebx+00]
NOP3_EBP EQU <DB 8Dh,6Dh,00h>      ; lea ebp, [ebp+00]
;NOP3_ESI EQU <DB 8Dh,76h,00h>     ; lea esi, [esi+00]
;NOP3_EDI EQU <DB 8Dh,7Fh,00h>     ; lea edi, [edi+00]

; SIB byte, and add signed byte:
NOP4_EAX EQU <DB 8Dh,44h,20h,0>        ;lea eax, [00][eax]
NOP4_ECX EQU <DB 8Dh,4Ch,21h,0>    ;lea ecx, [00][ecx]
NOP4_EDX EQU <DB 8Dh,54h,22h,0>    ;lea edx, [00][edx]
NOP4_EBX EQU <DB 8Dh,5Ch,23h,0>    ;lea ebx, [00][ebx]
NOP4_ESP EQU <DB 8Dh,64h,24h,0>    ;lea esp, [00][esp]
NOP4_EBP EQU <DB 8Dh,6Ch,25h,0>    ;lea ebp, [00][ebp]
NOP4_ESI EQU <DB 8Dh,74h,26h,0>    ;lea esi, [00][esi]
NOP4_EDI EQU <DB 8Dh,7Ch,27h,0>    ;lea edi, [00][edi]

;NOP5_EAX EQU <TEST EAX, 0FFFF0000h>
;NOP5_EAX EQU <CMP EAX, 0FFFF0000h>

; No SIB byte, but add signed dword:
NOP6_EAX EQU <DB 8Dh,080h,0,0,0,0>  ;lea eax, [eax+00000000]
NOP6_EBX EQU <DB 8Dh,09Bh,0,0,0,0>    ;lea ebx, [ebx+00000000]
NOP6_ECX EQU <DB 8Dh,089h,0,0,0,0>    ;lea ecx, [ecx+00000000]
NOP6_EDX EQU <DB 8Dh,092h,0,0,0,0>    ;lea edx, [edx+00000000]
NOP6_ESI EQU <DB 8Dh,0B6h,0,0,0,0>    ;lea esi, [esi+00000000]
NOP6_EDI EQU <DB 8Dh,0BFh,0,0,0,0>    ;lea edi, [edi+00000000]
NOP6_EBP EQU <DB 8Dh,0ADh,0,0,0,0>    ;lea ebp, [ebp+00000000]

; SIB byte, and add signed dword
NOP7_EAX EQU <DB 8Dh,084h,20h,0,0,0,0>        ;lea eax, [00000000][eax]
NOP7_ECX EQU <DB 8Dh,08Ch,21h,0,0,0,0>       ;lea ecx, [00000000][ecx]
NOP7_EDX EQU <DB 8Dh,094h,22h,0,0,0,0>       ;lea edx, [00000000][edx]
NOP7_EBX EQU <DB 8Dh,09Ch,23h,0,0,0,0>       ;lea ebx, [00000000][ebx]
NOP7_ESP EQU <DB 8Dh,0A4h,24h,0,0,0,0>       ;lea esp, [00000000][esp]
NOP7_EBP EQU <DB 8Dh,0ACh,25h,0,0,0,0>       ;lea ebp, [00000000][ebp]
NOP7_ESI EQU <DB 8Dh,0B4h,26h,0,0,0,0>       ;lea esi, [00000000][esi]
NOP7_EDI EQU <DB 8Dh,0BCh,27h,0,0,0,0>       ;lea edi, [00000000][edi]

; SIB byte adding signed dword:
;NOP7_EAX EQU <DB 8Dh,04h,05h,0,0,0,0>    ;LEA EAX, [][EAX+00000000]
;NOP7_ECX EQU <DB 8Dh,0Ch,0Dh,0,0,0,0>      ;LEA ECX, [][ECX+00000000]
;NOP7_EDX EQU <DB 8Dh,14h,15h,0,0,0,0>      ;LEA EDX, [][EDX+00000000]
;NOP7_EBX EQU <DB 8Dh,1Ch,1Dh,0,0,0,0>      ;LEA EBX, [][EBX+00000000]
;NOP7_EBP EQU <DB 8Dh,2Ch,2Dh,0,0,0,0>      ;LEA EBP, [][EBP+00000000]
;NOP7_ESI EQU <DB 8Dh,34h,35h,0,0,0,0>      ;LEA ESI, [][ESI+00000000]
;NOP7_EDI EQU <DB 8Dh,3Ch,3Dh,0,0,0,0>      ;LEA EDI, [][EDI+00000000]    
Razz (my replies are usually lacking in content, but this has got to be the largest post including absolutely nothing, lol!)

_________________
¯\(°_o)/¯ unlicense.org
Post 10 Apr 2008, 14:54
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
bitRAKE wrote:
Code:
;NOP5_EAX EQU <TEST EAX, 0FFFF0000h>
;NOP5_EAX EQU <CMP EAX, 0FFFF0000h>    
I hope AMD have not suggested that as a nop. Did they forget about the flags?
Code:
nop5 equ db 03eh,08dh,044h,020h,000h  ;lea eax,ds:[eax+000h] - uses SIB    
Post 10 Apr 2008, 15:05
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4237
Location: 2018
edfed
nops but ... change flags.
Code:
xor eax,0
and eax not 0
or eax,0
add eax,0
imul eax,1  ;to wait a long time
idiv eax,1   ; to wait a very long time
sub eax,0
jmp $+1
jmp $+2
    
Post 10 Apr 2008, 15:48
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
edfed wrote:
Code:
jmp $+1    
That one is dangerous!
Post 10 Apr 2008, 15:52
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4237
Location: 2018
edfed
sorry, i wanted to say:

jmp $+3

i never use nops...
Post 10 Apr 2008, 16:16
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
edfed wrote:
Code:
jmp $+3    
That one is dangerous also!
Post 10 Apr 2008, 16:21
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4237
Location: 2018
edfed
Crying or Very sad

i read somewhere the $+3 will generate a jmp word
and $+2 will generate a jmp byte.
Post 10 Apr 2008, 16:40
View user's profile Send private message Visit poster's website Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
If you want "jmp word" use "jmp word", don't expect assembler to generate any particular form. FASM can choose any form for given mnemonics, even 32bit offset would be good result for "jmp $+2"
Post 10 Apr 2008, 17:37
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2913
Location: [RSP+8*5]
bitRAKE
revolution wrote:
bitRAKE wrote:
Code:
;NOP5_EAX EQU <TEST EAX, 0FFFF0000h>
;NOP5_EAX EQU <CMP EAX, 0FFFF0000h>    
I hope AMD have not suggested that as a nop. Did they forget about the flags?
This is just my NOP notes file - some AMD / some my own musings.

_________________
¯\(°_o)/¯ unlicense.org
Post 10 Apr 2008, 21:23
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo
Kuemmel wrote:

I'm currently optimizing my fractal code regarding the x87 fpu version. I found in the AMD optimization guide "Software Optimization Guide for AMD Family 10h Processors" something interesting:


Uh, even my "new" AMD64x2 laptop (bought in August) only has family "0x0F". So this optimization (unless it also works on older machines) would be very very fringe.
Post 14 Apr 2008, 21:57
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.