flat assembler
Message board for the users of flat assembler.
 Home   FAQ   Search   Register 
 Profile   Log in to check your private messages   Log in 
flat assembler > Compiler Internals > More-than-one-byte NOPs.

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
-Id-



Joined: 07 Apr 2005
Posts: 1

More-than-one-byte NOPs.

I've noticed that when you use the align n pseudoinstruction, the compiler always does the padding with NOP (opcode 90h) instructions.

A feature that could be implemented for the next version of FASM is the usage or more-than-one-byte "nop" instructions, (whichever combination the compiler likes more Laughing).

Those instructions appear on the AMD Athlon optimization guide, page 59. Here is the link for the document:

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
Post 07 Apr 2005, 14:31
View user's profile Send private message Reply with quote
THEWizardGenius



Joined: 14 Jan 2005
Posts: 382
Location: California, USA

Try these...

Here's a two-byte "NOP":

jmp $+2

It simply jumps over itself, to the next intruction. There are other types of NOPs, check out this URL: http://www.df.lth.se/~john_e/gems/gem0008.html

It has 3-byte, 4-byte, 5-byte, and 6-byte NOPs. Of course, remember you can do things like this to get a longer NOP (I think this only works on 386+, but it's not likely to matter- most people have at least 386):

rol eax,32
rol ax,16
rol ah,8

Or a REALLY long "NOP" (which works on older CPU's):

push cl
mov cl,16
rol ax,cl
pop cl

Just try to think of different things which do absolutely nothing, or which have no effect on anything:

push ax
pop ax

This is very simple, and there are many possibilities. You can create really long NOPs if you want, or just put several short NOPs in a row to get a bigger one. Do whatever works. Smile

_________________
FASM Rules!
OS Dev is fun!
Pepsi tastes nasty!
Some ants toot!
It's over!
Post 07 Apr 2005, 15:22
View user's profile Send private message AIM Address Reply with quote
Reverend



Joined: 24 Aug 2004
Posts: 409
Location: Poland


Quote:
rol eax,32
rol ax,16
rol ah,8

push cl
mov cl,16
rol ax,cl
pop cl

push ax
pop ax

I don't think it's a good idea. During runtime, processor loads instructions into special cache memory and pre-executes them. Nop is a special opcode that does nothing, so the processor can skip it (or just give it less time for analyzing). Also nop as an instruction that does nothing, spends "nearly" no time when executing, opposite to the examples you've shown above. Also the next great thing is that when aligning code it is all filled with 90h byte. So later on, if the file is packed using for example some PE-Packer, repeated bytes are packed better
Post 07 Apr 2005, 16:29
View user's profile Send private message Visit poster's website Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 953
Location: Czechoslovakia

Yeah, also Intel Optimization Manual, chapter General Optimization Guidelines, Miscellaneous, NOPs, is very clear on this topic.
Post 07 Apr 2005, 16:55
View user's profile Send private message Visit poster's website Reply with quote
THEWizardGenius



Joined: 14 Jan 2005
Posts: 382
Location: California, USA

Yeah...

Good point. I'm not saying that is a good idea, I'm just saying that's one way to do it. NOP is still the best, but if you ever need something else (for example, if you have a project where for some reason you aren't allowed to use NOP) it can work, despite being slower. You definitely shouldn't use that kind of thing in a high-performance application or game; I am simply being creative.

Remember, NOP (90h) can be repeated so if you really need a several-byte NOP, you can repeat 90h rather than a more complex type of NOP. Let me remind you Reverend, that occasionally NOP is used for delays when programming hardware directly, since the CPU is faster than the hardware - and even NOP (or JMP $+2) can take enough time sometimes for the delay. This is one situation where my examples may be better, but "jmp $+2" is still the best 2-byte NOP.

I have to admit, -Id-, I don't know EXACTLY what you meant by what you said, I was just responding to the general idea of multiple-byte NOPs.

BTW, I think FASM at least tries, as far as possible, to stick to Intel standards. So most of the time, something that doesn't exist on Pentium will not be included. So if you are speaking of a multi-byte NOP (what's the use of that, anyways?) that only AMD processors support, I don't think FASM will ever use it. However, the more important AMD technologies FASM does usually use, I think.

_________________
FASM Rules!
OS Dev is fun!
Pepsi tastes nasty!
Some ants toot!
It's over!
Post 07 Apr 2005, 20:19
View user's profile Send private message AIM Address Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 953
Location: Czechoslovakia


Quote:

Remember, NOP (90h) can be repeated so if you really need a several-byte NOP, you can repeat 90h rather than a more complex type of NOP.


Well, there has to be some reason why Intel recommends multi-byte nop instead of repeated nop (90h). I guess it has something with the decoder, decoding one instruction is probably a bit faster than decoding 6 instructions (like LEA EAX,[EAX+00000000] instead of 6 NOPs).

Quote:

"jmp $+2" is still the best 2-byte NOP.


Such instruction may reload the cache so it is not true nop. Intel recommends MOV reg1, reg1 instead.
Post 07 Apr 2005, 22:00
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany

Re: Try these...


THEWizardGenius wrote:

jmp $+2


This one is a good NOP example


THEWizardGenius wrote:

rol eax,32
rol ax,16
rol ah,8


Actually, these ain't NOPs, 'cause they modify the flags! This should be taken into account.


THEWizardGenius wrote:

push cl
mov cl,16
rol ax,cl
pop cl


The same is true for this example. And additionally, this example requires a stack to be present. This might cause problems in only a few cases.

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 08 Apr 2005, 06:27
View user's profile Send private message Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 953
Location: Czechoslovakia

We should say what the nop really is:

Intel Optimization Manual wrote:
Code generators generate a no-operation (NOP) to align instructions.
The NOPs are recommended for the following operations:
• 1-byte: xchg EAX, EAX
• 2-byte: mov reg, reg
• 3-byte: lea reg, 0 (reg) (8-bit displacement)
• 6-byte: lea reg, 0 (reg) (32-bit displacement)
These are all true NOPs, having no effect on the state of the machine
except to advance the EIP. Because NOPs require hardware resources to
decode and execute, use the least number of NOPs to achieve the
desired padding.


It means that instructions which have effect on the state of the cache (JMP $+2) or of the flags or of the stack are not true nops. You are speaking only about some stuffing.

BTW, there is not an instruction like PUSH CL.

_________________
x86asm.net
Post 08 Apr 2005, 06:56
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar
Assembly Artist


Joined: 16 Jun 2003
Posts: 6825
Location: Kraków, Poland

Here's a sample of macro that does the custom NOP-filling (this one is for the 32-bit code):

Code:
macro align value
 {
  virtual
   align value
   ..align = $ - $$
  end virtual
  times ..align/8 db $66$8D$04$05$00$00$00$00
  ..align = ..align mod 8
  if ..align = 7
   db $8D$04$05$00$00$00$00
  else if ..align = 6
   db $8D$80$00$00$00$00
  else if ..align = 5
   db $66$8D$54$22$00
  else if ..align = 4
   db $8D$44$20$00
  else if ..align = 3
   db $8D$40$00
  else if ..align = 2
   db $8B$C0
  else if ..align = 1
   db 90h
  end if
 }

Post 08 Apr 2005, 17:16
View user's profile Send private message Visit poster's website Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 953
Location: Czechoslovakia

Nice Smile

668D542200 LEA DX,[EDX]

may cause partial register stall. Better will be

3E8D542200 LEA EDX,DS:[EDX]

(or whatever segment register you want)
Post 09 Apr 2005, 06:39
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar
Assembly Artist


Joined: 16 Jun 2003
Posts: 6825
Location: Kraków, Poland

equivalent for GNU AS .balign

BTW, someone has just asked me in the e-mail how to implement the equivalent of .balign directive used in GNU AS. I'm posting also here the macro I made in quick reply, since it demonstrates once again the method I've show above, which you can use to highly customize the fasm's alignment capabilities:

Code:
macro .balign value,fill,limit
{
 virtual
  align value
  ..align = $ - $$
 end virtual
 if ..align <= limit+0 | limit eq
  if fill eq
   rb ..align
  else
   times ..align db fill
  end if
 end if
}

Post 09 Apr 2005, 18:33
View user's profile Send private message Visit poster's website Reply with quote
Adam Kachwalla



Joined: 01 Apr 2006
Posts: 150

Just want to know; why would anyone use a NOP? It wastes some 20-200 clock cycles, and some extra space as well! (Maybe that's why Longhorn takes up 15GB)
Post 23 May 2007, 09:20
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3172
Location: Denmark

Heh, NOP wasting 20-200 clock cycles? Smile

Answer is of course that aligning some loop labels can improve performance, just like aligning data can improve performance.
Post 23 May 2007, 11:57
View user's profile Send private message Visit poster's website Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7109
Location: Slovakia

but all those precious bits wasted!!! Evil or Very Mad Wink
Post 23 May 2007, 12:22
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
ack



Joined: 06 Aug 2008
Posts: 1

BTW ...

0x93 = XCHG BX,AX
0x92 = XCHG DX,AX
0x91 = XCHG CX,AX
0x90 = XCHG AX,AX = NOP
Post 06 Aug 2008, 01:52
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2146
Location: Estonia

Erm, NOP takes 0.5 clocks from Pentium and later.
It takes 0.333 clocks from Pentium III and later AND
it takes 0.25-0.333 (depending on how you schedule) clocks from Core arch. and later.
So the maximum needed 15-byte alignment takes 5 clock maximum!!!

I know, Adam Kachwalla, that you weren't all that serious, but clarification to other ppl who might not have understood the joke Wink

PS. Actually you can use the "wasted" bits to compose loop headers with better efficiency Idea
Post 06 Aug 2008, 12:18
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2279
Location: Usono (aka, USA)

IIRC, old Turbo C uses xchg bx,bx for a 2-byte NOP.
Post 21 Aug 2008, 02:09
View user's profile Send private message Visit poster's website Reply with quote
asmcoder



Joined: 02 Jun 2008
Posts: 785

[content deleted]

[content deleted]


Last edited by asmcoder on 14 Aug 2009, 14:48; edited 1 time in total
Post 21 Jul 2009, 17:25
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160


f0dder wrote:
Heh, NOP wasting 20-200 clock cycles? Smile

Answer is of course that aligning some loop labels can improve performance, just like aligning data can improve performance.

Why not use that space to store something that doesn't need aligned, and just jump over it? Wouldn't you still have the performance benefit of alignment then, without taking more space?



asmcoder wrote:
nop
its ridiculous do do that your doing.

Image
Somebody translate please.
Post 24 Jul 2009, 08:43
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany

Aligning the beginning of the code of a loop to the cache line size of the L1 instruction cache really can increase its execution speed, but only if you don't overdo it, since you can easily cause capacity misses in the L1 instruction cache(overloading it) when aligning every loop in a big program this way.

So, you should only align the beginning of the code of loops that are actually looped over frequently and that contribute a significant amount of time to the total execution time of a process or task.

And I had the unpleasant honour (gcc optimization is just a f++king messy war of compiler switches, it has over 600 of them!) to optimize some gcc settings for some C-code by changing the alignment of some loops. And then I noticed that gcc even supports aligning of some code blocks of conditional statements "if" and "case" statements. It turns out that the effect of the later is neglegible or even adverse in over 95% of the time and yields hardly any improvements at all.
Post 11 Jun 2011, 14:08
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >

Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Main index   Download   Documentation   Examples   Message board
Copyright © 2004-2018, Tomasz Grysztar.
Powered by rwasa.