flat assembler
Message board for the users of flat assembler.
Index
> Compiler Internals > More-than-one-byte NOPs. Goto page 1, 2 Next |
Author |
|
-Id- 07 Apr 2005, 14:31
I've noticed that when you use the align n pseudoinstruction, the compiler always does the padding with NOP (opcode 90h) instructions.
A feature that could be implemented for the next version of FASM is the usage or more-than-one-byte "nop" instructions, (whichever combination the compiler likes more ). Those instructions appear on the AMD Athlon optimization guide, page 59. Here is the link for the document: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf |
|||
07 Apr 2005, 14:31 |
|
THEWizardGenius 07 Apr 2005, 15:22
Here's a two-byte "NOP":
jmp $+2 It simply jumps over itself, to the next intruction. There are other types of NOPs, check out this URL: http://www.df.lth.se/~john_e/gems/gem0008.html It has 3-byte, 4-byte, 5-byte, and 6-byte NOPs. Of course, remember you can do things like this to get a longer NOP (I think this only works on 386+, but it's not likely to matter- most people have at least 386): rol eax,32 rol ax,16 rol ah,8 Or a REALLY long "NOP" (which works on older CPU's): push cl mov cl,16 rol ax,cl pop cl Just try to think of different things which do absolutely nothing, or which have no effect on anything: push ax pop ax This is very simple, and there are many possibilities. You can create really long NOPs if you want, or just put several short NOPs in a row to get a bigger one. Do whatever works. _________________ FASM Rules! OS Dev is fun! Pepsi tastes nasty! Some ants toot! It's over! |
|||
07 Apr 2005, 15:22 |
|
MazeGen 07 Apr 2005, 16:55
Yeah, also Intel Optimization Manual, chapter General Optimization Guidelines, Miscellaneous, NOPs, is very clear on this topic.
|
|||
07 Apr 2005, 16:55 |
|
THEWizardGenius 07 Apr 2005, 20:19
Good point. I'm not saying that is a good idea, I'm just saying that's one way to do it. NOP is still the best, but if you ever need something else (for example, if you have a project where for some reason you aren't allowed to use NOP) it can work, despite being slower. You definitely shouldn't use that kind of thing in a high-performance application or game; I am simply being creative.
Remember, NOP (90h) can be repeated so if you really need a several-byte NOP, you can repeat 90h rather than a more complex type of NOP. Let me remind you Reverend, that occasionally NOP is used for delays when programming hardware directly, since the CPU is faster than the hardware - and even NOP (or JMP $+2) can take enough time sometimes for the delay. This is one situation where my examples may be better, but "jmp $+2" is still the best 2-byte NOP. I have to admit, -Id-, I don't know EXACTLY what you meant by what you said, I was just responding to the general idea of multiple-byte NOPs. BTW, I think FASM at least tries, as far as possible, to stick to Intel standards. So most of the time, something that doesn't exist on Pentium will not be included. So if you are speaking of a multi-byte NOP (what's the use of that, anyways?) that only AMD processors support, I don't think FASM will ever use it. However, the more important AMD technologies FASM does usually use, I think. _________________ FASM Rules! OS Dev is fun! Pepsi tastes nasty! Some ants toot! It's over! |
|||
07 Apr 2005, 20:19 |
|
MazeGen 07 Apr 2005, 22:00
Quote:
Well, there has to be some reason why Intel recommends multi-byte nop instead of repeated nop (90h). I guess it has something with the decoder, decoding one instruction is probably a bit faster than decoding 6 instructions (like LEA EAX,[EAX+00000000] instead of 6 NOPs). Quote:
Such instruction may reload the cache so it is not true nop. Intel recommends MOV reg1, reg1 instead. |
|||
07 Apr 2005, 22:00 |
|
MCD 08 Apr 2005, 06:27
THEWizardGenius wrote:
This one is a good NOP example THEWizardGenius wrote:
Actually, these ain't NOPs, 'cause they modify the flags! This should be taken into account. THEWizardGenius wrote:
The same is true for this example. And additionally, this example requires a stack to be present. This might cause problems in only a few cases. _________________ MCD - the inevitable return of the Mad Computer Doggy -||__/ .|+-~ .|| || |
|||
08 Apr 2005, 06:27 |
|
MazeGen 08 Apr 2005, 06:56
We should say what the nop really is:
Intel Optimization Manual wrote: Code generators generate a no-operation (NOP) to align instructions. It means that instructions which have effect on the state of the cache (JMP $+2) or of the flags or of the stack are not true nops. You are speaking only about some stuffing. BTW, there is not an instruction like PUSH CL. |
|||
08 Apr 2005, 06:56 |
|
Tomasz Grysztar 08 Apr 2005, 17:16
Here's a sample of macro that does the custom NOP-filling (this one is for the 32-bit code):
Code: macro align value { virtual align value ..align = $ - $$ end virtual times ..align/8 db $66, $8D, $04, $05, $00, $00, $00, $00 ..align = ..align mod 8 if ..align = 7 db $8D, $04, $05, $00, $00, $00, $00 else if ..align = 6 db $8D, $80, $00, $00, $00, $00 else if ..align = 5 db $66, $8D, $54, $22, $00 else if ..align = 4 db $8D, $44, $20, $00 else if ..align = 3 db $8D, $40, $00 else if ..align = 2 db $8B, $C0 else if ..align = 1 db 90h end if } |
|||
08 Apr 2005, 17:16 |
|
MazeGen 09 Apr 2005, 06:39
Nice
668D542200 LEA DX,[EDX] may cause partial register stall. Better will be 3E8D542200 LEA EDX,DS:[EDX] (or whatever segment register you want) |
|||
09 Apr 2005, 06:39 |
|
Tomasz Grysztar 09 Apr 2005, 18:33
BTW, someone has just asked me in the e-mail how to implement the equivalent of .balign directive used in GNU AS. I'm posting also here the macro I made in quick reply, since it demonstrates once again the method I've show above, which you can use to highly customize the fasm's alignment capabilities:
Code: macro .balign value,fill,limit { virtual align value ..align = $ - $$ end virtual if ..align <= limit+0 | limit eq if fill eq rb ..align else times ..align db fill end if end if } |
|||
09 Apr 2005, 18:33 |
|
Adam Kachwalla 23 May 2007, 09:20
Just want to know; why would anyone use a NOP? It wastes some 20-200 clock cycles, and some extra space as well! (Maybe that's why Longhorn takes up 15GB)
|
|||
23 May 2007, 09:20 |
|
f0dder 23 May 2007, 11:57
Heh, NOP wasting 20-200 clock cycles?
Answer is of course that aligning some loop labels can improve performance, just like aligning data can improve performance. |
|||
23 May 2007, 11:57 |
|
vid 23 May 2007, 12:22
but all those precious bits wasted!!!
|
|||
23 May 2007, 12:22 |
|
ack 06 Aug 2008, 01:52
BTW ...
0x93 = XCHG BX,AX 0x92 = XCHG DX,AX 0x91 = XCHG CX,AX 0x90 = XCHG AX,AX = NOP |
|||
06 Aug 2008, 01:52 |
|
Madis731 06 Aug 2008, 12:18
Erm, NOP takes 0.5 clocks from Pentium and later.
It takes 0.333 clocks from Pentium III and later AND it takes 0.25-0.333 (depending on how you schedule) clocks from Core arch. and later. So the maximum needed 15-byte alignment takes 5 clock maximum!!! I know, Adam Kachwalla, that you weren't all that serious, but clarification to other ppl who might not have understood the joke PS. Actually you can use the "wasted" bits to compose loop headers with better efficiency |
|||
06 Aug 2008, 12:18 |
|
rugxulo 21 Aug 2008, 02:09
IIRC, old Turbo C uses xchg bx,bx for a 2-byte NOP.
|
|||
21 Aug 2008, 02:09 |
|
asmcoder 21 Jul 2009, 17:25
[content deleted]
Last edited by asmcoder on 14 Aug 2009, 14:48; edited 1 time in total |
|||
21 Jul 2009, 17:25 |
|
Azu 24 Jul 2009, 08:43
f0dder wrote: Heh, NOP wasting 20-200 clock cycles? asmcoder wrote: nop Somebody translate please. |
|||
24 Jul 2009, 08:43 |
|
MCD 11 Jun 2011, 14:08
Aligning the beginning of the code of a loop to the cache line size of the L1 instruction cache really can increase its execution speed, but only if you don't overdo it, since you can easily cause capacity misses in the L1 instruction cache(overloading it) when aligning every loop in a big program this way.
So, you should only align the beginning of the code of loops that are actually looped over frequently and that contribute a significant amount of time to the total execution time of a process or task. And I had the unpleasant honour (gcc optimization is just a f++king messy war of compiler switches, it has over 600 of them!) to optimize some gcc settings for some C-code by changing the alignment of some loops. And then I noticed that gcc even supports aligning of some code blocks of conditional statements "if" and "case" statements. It turns out that the effect of the later is neglegible or even adverse in over 95% of the time and yields hardly any improvements at all. |
|||
11 Jun 2011, 14:08 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.