flat assembler
Message board for the users of flat assembler.
Index
> Main > "mov eax,0" or "xor eax,eax" Goto page 1, 2, 3, 4 Next |
Author |
|
LocoDelAssembly 06 Dec 2005, 02:33
http://www.agner.org/assem/pentopt.pdf look chapter "15.10 Breaking dependences"
|
|||
06 Dec 2005, 02:33 |
|
Madis731 06 Dec 2005, 08:19
There has been a debate on that before on these boards and we came to a conclusion that the best in speedwise/sizewise is AND EAX,00h because XOR EAX,EAX is more complicated and MOV EAX,00000000h is too long.
When you read agner's manual you should note that SUB EAX,EAX is bad because it is not bit-independent it could get as bad as one bit overflow carries to all 31 other bits. That is not the case with XOR. EDIT: NB! There are special cases though where AND EAX,00h crosses DWORD fetch boundary while XOR EAX,EAX doesn't. So the conclusion is not final and is 3 bytes, where xor is 2 bytes. Sorry for the type'o Last edited by Madis731 on 06 Dec 2005, 12:39; edited 1 time in total |
|||
06 Dec 2005, 08:19 |
|
MazeGen 06 Dec 2005, 09:45
What about this?
|
|||
06 Dec 2005, 09:45 |
|
MazeGen 06 Dec 2005, 10:05
Madis, I wonder how can be XOR r, r more complicated than AND r, imm?
|
|||
06 Dec 2005, 10:05 |
|
decard 06 Dec 2005, 12:15
MazeGen: so, if I understood it correctly, xor is better for P4 and above:? ?
|
|||
06 Dec 2005, 12:15 |
|
Madis731 06 Dec 2005, 12:44
http://www.play-hookey.com/digital/images/xorn-01.gif
Four NOT-AND (NAND) gates used, while AND uses 1 or 2. That is because AND can be derived from regular switches two or more in a row, but XOR logic is very controversial to natural human brain. You could think of it as a carryless adder. And when you consider XOR and adder, then adders are a lot more complicated than AND ...and MazeGen - the thread you posted - at the very end I put some test results from a CLI/4Giga loops/STI test case. Code: times_4294967295: XOR EAX,EAX ;111124clk 100.0% xor's got a very tricky logic SUB EAX,EAX ;109257clk 98.3% carries make a long dependancy MOV EAX,0 ;105052clk 94.5% too much memory overhead AND EAX,0 ; 85680clk 77.1% and's got the sweetest logic I really hoped the XOR would be at least as fast as SUB, but wow... |
|||
06 Dec 2005, 12:44 |
|
vid 06 Dec 2005, 13:31
madis: could you post the entire code you tested it on?
|
|||
06 Dec 2005, 13:31 |
|
tom tobias 06 Dec 2005, 16:21
Madis731 wrote: ... I suppose that ENTER, LEAVE, and PUSHA modify EBP. However, if one is not using those instructions, then an alternative, to eliminate the heavy burden of "memory overhead", associated with MOV EAX,0, yet retain the spirit of writing PROGRAMS, instead of CODE, would be to assign EBP the value of 0, (initialization), and thereafter, use EBP as a CONSTANT, thus: MOV EAX,EBP ; remember, EBP is always equal to zero I guess that operation would then be just as fast as xor eax,eax, though, for me personally, the penalty of obscurity and lack of readability with mov eax,ebp, renders this solution useless. I prefer to pay the penalty, SLOWER, but easier to read: MOV EAX, ZERO. Pity that the Intel architecture has such a paucity of registers, however, I am surprised to learn that there is such a severe penalty (5.5% slower) for constants sitting in cache. Thank you Madis for your excellent travail! |
|||
06 Dec 2005, 16:21 |
|
Madis731 06 Dec 2005, 17:32
I downloaded an IOPL module somewhere, but I don't remember anymore wher from. I've also lost the code I tested it with because I didn't think it was neccessary :S. I tryed writing another one and succeeded in arguing with myself. The platform difference shouldn't matter (the laptop I'm using is 700MHz PentiumIII Coppermine-T while last time I used 2.66GHz PentiumIV Northwood).
And damn, these clocks were normalized values so I can't even calculate the real clock rate :@. Hmm, lets just discard them... Code: ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: ; Calcuting cycle - ; by Edgar Barbosa, a.k.a Opcode ;:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: include "win32ax.inc" start: int 0edh nop cli rdtsc mov [dat], eax nop ;;;;;;;;;;;;;;[code] mov ecx,100000000 .Calc8: mov eax,0FEDCBA98h xor eax,eax mov eax,0FFFFFFFFh xor eax,eax mov eax,055555555h xor eax,eax mov eax,076543210h xor eax,eax ;XOR = 600000003 or 600000007 clocks ;SUB = 600000003 or 600000007 clocks ;AND = 500000004 or 500000008 clocks ;MOV = 500000010 or 500000014 clocks sub ecx,1 jnz .Calc8 ;;;;;;;;;;;;;;[/code] rdtsc sub eax, [dat] sub eax,189 sti nop cinvoke wsprintf, dat, "%020d clock cycles", eax invoke MessageBox, NULL, ecx, "Opcode IOPL hack", MB_OK invoke ExitProcess, 0 data import library kernel,'KERNEL32.DLL',user,'USER32.DLL' import kernel,ExitProcess,'ExitProcess' import user,wsprintf,'wsprintfA',MessageBox,'MessageBoxA' dat dd 0,0,0,0 , 0,0,0,0,0 end data Another type of inner loop: Code: mov ecx,100000000 .Calc8: mov eax,0 mov eax,0 mov eax,0 mov eax,0 ;XOR = 466666616 or 466666620 clocks ;SUB = 466666616 or 466666620 clocks ;AND = 444680876 or 444680880 clocks ;MOV = 300000005 or 300000009 clocks sub ecx,1 jnz .Calc8 P.S. tom tobias: I love your sarcasm but for this 'travail' I had to open up a dictionary. Why couldn't you just say 'hard work' to us non-native English speaking guys here paucity = smallness, fewness |
|||
06 Dec 2005, 17:32 |
|
decard 06 Dec 2005, 17:41
That's better. If he uses more uncommon words, it will make you check them in dictionary, this way you will learn a new phrase. And I'm sure you will remember it. (sorry for getting offtopic)
|
|||
06 Dec 2005, 17:41 |
|
Madis731 06 Dec 2005, 18:09
...and here's another inner loop. This is how I got results that XOR is better than MOV:
Code: ;;;;;;;;;;;;;;[code] rept 1000 {mov eax,0} ;13497,13423,11106,12109,13325 | 5 consecutive tests ;rept 1000 {and eax,0} ;7015,6777,6842,7008,7687 ;rept 1000 {sub eax,eax} ;5098,4629,4685,4937,4650 ;rept 1000 {xor eax,eax} ;4461,4419,4873,4755,4629 ;;;;;;;;;;;;;;[/code] As you can see the XOR takes the first place on a PIII, but AND was the best on a PIV so it all depends on the pipeline and cache etc. |
|||
06 Dec 2005, 18:09 |
|
tom tobias 06 Dec 2005, 18:38
Madis731 wrote: ... I love your sarcasm but for this 'travail' I had to open up a dictionary. Why couldn't you just say 'hard work' to us non-native English speaking guys here Lots here: 1. Your effort was and IS very informative, detailed, thorough, readable, and INTERESTING. To me, this effort GOES WAY BEYOND mere "hard working", and in English, when we wish to laud someone's effort, we move away from our Germanic roots, and substitute the Latin equivalent (i.e. FRENCH), as indicative of TRULY HIGH ACCOMPLISHMENT. Since 1066, French, not English, is the language of choice for signifying praise worthiness, among native English speakers (meme si nous ne peux capable ni de parler, ni d'ecrire!!!). I think this is also true in MOST of Eastern Europe, especially Poland and Russia, countries with the bulk of the membership of the FASM forum. 2. OK, there was a tiny bit of tongue in cheek, but really, sincerely, I DO ENJOY reading your contributions, and felt that simply labelling your submissions to this thread "hard work" demeaned your labor. You have given us some actual data. Terrific! 3. What about MOV eax, ebp? Is it as fast, or faster or slower than xor eax,eax? Sincerely, without ANY sarcasm. |
|||
06 Dec 2005, 18:38 |
|
Madis731 06 Dec 2005, 18:58
Hi,
First, I'm sorry that I misunderstood you. Second, I did test with EBP, but the "without interrupts" part is very dangerous. Tests with above 1000 iterations ended in BSOD and restart. Tests with mov ebp,0 \ mov eax,ebp also ended with BSOD Code: push ebp xor ebp,ebp rept 1000 {mov eax,ebp} ;4179,4886,5061,4620,4606 ;Note! These are not comparable with previous ones because of the overhead pop ebp Don't program interruptless code unless you know what you are doing Maybe my CPU (Pentium III) has some kind of mechanism to detect infinite loops and stops it by reboot or halt . Maybe... 1000 times the same instruction is called optimization, but 10000 times the same sequence is definately not logical code |
|||
06 Dec 2005, 18:58 |
|
vid 06 Dec 2005, 19:37
what's that int ED? maybe you could make article on it, if it is something hacky-cracky interesting
|
|||
06 Dec 2005, 19:37 |
|
Madis731 07 Dec 2005, 12:34
I think you should talk with Edgar Barbosa about this. I think the loaded .sys-file defines some interrupt at this address. Haven't seen the source - just using the binaries
EDIT: Did some diggin' on the net and voila: http://win.asmcommunity.net/board/index.php?topic=18859.0 |
|||
07 Dec 2005, 12:34 |
|
MazeGen 07 Dec 2005, 16:05
decard wrote: MazeGen: so, if I understood it correctly, xor is better for P4 and above:? ? According to those numbers, MOV is A BIT faster, but in context of dependecies, XOR may be faster. |
|||
07 Dec 2005, 16:05 |
|
El Tangas 11 Dec 2005, 21:18
Either instruction can be faster, the important thing is not to cross cache boundaries. If you test an endless repeat of the same instruction, this will favour the shorter instructions (xor and sub), because more cache boundaries will be crossed. If you test a loop, results may be different.
Now, what about the worst way to clear a register? I propose shr reg,32. Just a note: In the athlon series, the mov reg, imm are "executed" by the decoder, so they are as nops and dont take execution resources. |
|||
11 Dec 2005, 21:18 |
|
MazeGen 11 Dec 2005, 21:26
Heh, you can't do SHL/R with 32.
The wortest way? What about IMUL reg,reg,0? |
|||
11 Dec 2005, 21:26 |
|
Madis731 11 Dec 2005, 22:09
There are endless possibilities like:
Code: test_zero: sub eax,1 cmp eax,0 jne test_zero ;-OR- lea eax,[0] ; That is 6 bytes: 8D 05 00 00 00 00 P.S. and yes, you can do: shl eax,32 shr eax,32 ; but I don't remember which one of them was optimized The requirement is imm8 so you can even do SHR EBP,255 EDIT: Hmm, okey here's a summary: Code: 33C0 XOR EAX, EAX ;Variants for 2-byte resets 2BC0 SUB EAX, EAX 83E0 00 AND EAX, 0 ;Many possibilities for 3-byte resets C1E0 32 SHL EAX, 32 C1E8 32 SHR EAX, 32 6BC0 00 IMUL EAX, EAX, 0 B8 00000000 MOV EAX, 0 ;There are no 4-byte ones but there is one 5-byte reset and 8D05 00000000 LEA EAX, [0] ; a 6-byte one. ; An exception here: 6B05 00000000 00 IMUL EAX, [0], 0 ; A 7-byte one but you must have a read-accessible ; memory-address here Its getting interesting |
|||
11 Dec 2005, 22:09 |
|
Goto page 1, 2, 3, 4 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.