flat assembler
Message board for the users of flat assembler.

Index > Main > "mov eax,0" or "xor eax,eax"

Goto page Previous  1, 2, 3, 4  Next
Author
Thread Post new topic Reply to topic
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 7725
Location: Kraków, Poland
Tomasz Grysztar
Actually I see all of you to be right (and notice more than two sides in this discussion). Wink What I wrote was not an attempt of polemic, I just explain my private and personal programming habits - so you'd know why such constructions are used in fasm's sources.
Post 13 Dec 2005, 19:30
View user's profile Send private message Visit poster's website Reply with quote
RedGhost



Joined: 18 May 2005
Posts: 443
Location: BC, Canada
RedGhost
decard wrote:
Me too Wink Xoring is something natural for me now, just like using "or" for testing if some register is zeroed. In Fresh source, John Found was always using "test" for that purpose. I just wonder, if there's any speed difference between "or" and "test"?


yeah i have also come to use 'or' for testing jnz/jz and use that or other ways to test for jnz/jz more then i use cmp anymore Mad

_________________
redghost.ca
Post 15 Dec 2005, 11:10
View user's profile Send private message AIM Address MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17287
Location: In your JS exploiting you and your system
revolution
TEST is preferred over OR/AND because OR/AND will allocate an internal resource in the CPU for writeback to the register/memory and writeback to the flags. TEST only reads the register/memory and writes to the flags.

With superscalar architechure this is an important distinction and helps improve speed.
Post 15 Dec 2005, 12:53
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo
I was messing with DJGPP (GCC 3.4.3) yesterday, and when I used -S -masm=intel, I noticed a lot of TEST eax,eax and XOR ebx,ebx.

I used to always use OR (compare to zero) and XOR (clear) but switched to TEST and SUB recently just out of curiosity! Razz

[EDIT]
Okay, I upgraded to GCC 3.4.4 then (reluctantly) to GCC 4.0.1 after being frustrated with trying to compile stupid Allegro (4.20 didn't work, but I think I got 4.03 compiled with GCC 3.2.3). I must say, newer versions of GCC do seem to break a lot of stuff (almost as many flaws as improvements). Rolling Eyes
[/EDIT]
Post 15 Dec 2005, 19:02
View user's profile Send private message Visit poster's website Reply with quote
El Tangas



Joined: 11 Oct 2003
Posts: 120
Location: Sunset Empire
El Tangas
I actually had never noticed that the count in shift instructions was mod 32 Embarassed . Guess shl(r) reg,32(64,96, etc) are just expensive nops...
How does it work in 64 bit mode, btw?
Post 17 Dec 2005, 19:49
View user's profile Send private message Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 975
Location: Czechoslovakia
MazeGen
El Tangas wrote:
How does it work in 64 bit mode, btw?

For 64-bit operand, it is mod 64 Smile
Post 17 Dec 2005, 19:59
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17287
Location: In your JS exploiting you and your system
revolution
Quote:
but switched to TEST and SUB recently just out of curiosity
Also XOR is preferred over SUB because of the special hardware support within the cpu to break dependency chains, SUB reg,reg will wait for any previous results to be written before doing the SUB but XOR reg,reg is smart and won't wait.
Post 18 Dec 2005, 00:13
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
>On newer processors Test is faster than Or by a decent amount.

>64bit shifting is moded depending on the register 32 for 32bit 64 for 64bit. Its done this way because of the variable shift instruction SHR reg,CL, so if your using some algorithm to get the shift count you don't have to mod CL before hand.

>I went from programming in VisualBasic to learning assembly and I had no problem with Xor reg, reg meaning clear reg. I don't think we give noobs enough credit in that respect.

I like the xor edi,edx argument about it being hard to find a bug, but when you think logically it's a minority situation and therefore a logical fallacy to use it as justification. Very Happy

It would be like saying, "Smoking 12 cigars a day is healthy, because my uncle's cousin did it and he lived 80years." Even though there's overwhelming evidence to show that smoking 12 cigars a day is not healthy.

People used to use assembly because it was the only way to optimize code to a speed where it would be usable.
Now there's compilers coded by rooms of people with PhD's in math and computer science that can optimize code better than an experienced ASM coder (and if not better in at least a fraction of the time). I forgot the point of this so... sorry you read it.
Post 18 Dec 2005, 05:49
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17287
Location: In your JS exploiting you and your system
revolution
Quote:
Now there's compilers coded by rooms of people with PhD's in math and computer science that can optimize code better than an experienced ASM coder
I find that very very hard to believe. I have never yet seen an example of compiler written (non-trivial) code that cannot be properly optimised with assembly. Even this you added
Quote:
(and if not better in at least a fraction of the time).
is not relavent because compilers only apply simple optimisations that can be easily done by an experienced programmer in assembly without even having to think hard, it comes automatically in a lot of cases, just like XOR reg,reg is automatic for a lot of people in this forum.

So many times the HLL compiler just doesn't understand what is the desired goal. You are forced to use the structure and commands of the particular language which is necessarily an abstraction from the actual problem at hand. You never see a program written that the author proudly states "we didn't use any assembly code because we wanted the best possible speed", it is just not the case. Indeed, the opposite is true, you can easily find people stating that they used assembly code in the critical loop because of speed. Even Microsoft use assembly for most of the very important and heavily used system calls, they keep different source code for each processor that they support.

Sorry to flame on your comment like this, but stuff like this is always being stated by HLL users as facts when they never stop to think about what they are actually saying.

Quote:
It would be like saying, "Smoking 12 cigars a day is healthy, because my uncle's cousin did it and he lived 80years." Even though there's overwhelming evidence to show that smoking 12 cigars a day is not healthy.
This is known as anectodal evidence. Another thing that happens in a similar fashion is people saying "well everyone is using/doing it so it must be good". We all should try to stop a short while and think about just what are we really saying, think about whether is it a sensible idea/comment or whether we are just saying it because someone else said it.
Post 18 Dec 2005, 09:55
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
> rev, use vc++ or gcc to compile a stacked for loop, with the exception of mmx, SSE instructions HLL compilers have a solid optimization core for general purpose asm code. Of course the compilers can only do so much with the object oriented hoohaa and classes that they have to deal with, but thats the lnaguage not the compilers ability to optimize.

What I said holds for 32bit programming I haven't done enough testing with compilers on 64bit to make an educated statement.
Post 18 Dec 2005, 21:12
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo
A long time ago, I heard TEST reg,reg was better for older machines (286?) than OR reg,reg.

Besides, doing the following with GNU/DJGPP sed 4.1.4 should quickly change it back: Smile
Code:
sed -i -e "s/[tT][eE][sS][tT] \+\(...\?\),\1/or \1,\1/g" myfile.asm    


P.S., r22, if you think a certain compiler can optimize so well, at least tell us which one you're referring to!
Post 19 Dec 2005, 22:20
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Every compiler can if you let them (or if they have the set amount of parameters that you can change). The problem is when we see data structures and algoritms, the compiler only sees CODE and depending on the amount it gets fed at once it can make different judgements on that bunch and related to that - it takes time. We (human beings) can throw away stuff immediately if we can see it not working, but compiler brutally tries everything inspite of that.
Post 20 Dec 2005, 08:40
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
Madis731 wrote:
Every compiler can if you let them (or if they have the set amount of parameters that you can change). The problem is when we see data structures and algoritms, the compiler only sees CODE and depending on the amount it gets fed at once it can make different judgements on that bunch and related to that - it takes time. We (human beings) can throw away stuff immediately if we can see it not working, but compiler brutally tries everything inspite of that.


True. Compilers are mindless, they simply see what WE coded and try to optimize that. All they see is code (or some particular piece of info we tell them via source code).

If I coded something for hidden surface removal (like Z-buffering, S-buffering) in 3D graphics, the compiler will just USE my algorithms and TRY to optimize them. but, I (or any other human being who understands what my code is supposed to do) can improve that algorithm accordingly, because I know what it's supposed to do, I know it's purpose, I know what it will actually do, and so I can improve it accordingly. The more you abstract something to a compiler (i.e object-oriented, etc..), the more it will have less information about the REAL code it needs to produce, so less optimization..... Simply, you can't create something from nothing. Smile

after all, that's why we are programmers. Wink

BTW: in the worst case, the asm programmer can simply take a look at the compiler's code and improve it. so he can ALWAYS optimize better than a compiler..

...at least until computers begin to think (AI) and they will code everything in hexadecimal anyway. Razz but that's probably not happening because all software companies (like big M$) will be totally useless, since we can easily tell the AI to: build me a Wind0ze with graphics here and there... where's all the fun? Sad


anyway, back on topic: Since this topic is about the fastest way to do a certain thing (at least I understood that from it Smile ), I have to ask a few questions:

seems xor reg, reg is the fastest way to clear a reg.. what about xor ah, ah? is it still faster than mov ah, 0 or mov ah, reg8 where reg8 is 0? probably yes, but mov does not require computations, does it? just move bits from one operand into another.. if anyone knows, please tell me. thx

add reg, reg is faster than shl reg, 1? anyone tested, because I'm not experienced in testing at clock cycle level (always give wrong results Confused) and of course, a lot faster than imul reg, 2 Very Happy

I always thought test reg, reg is faster than or reg, reg because it doesn't write to register.. guess it's the case Smile


Dunno if this helps out, but: logical operators and and or are very simple to implement in circuit, usually requiring 2 or 3 gates. xor on the other hand, needs a bit more gates, but it is faster than and eax, 0 because it doesn't need an immediate value...

add and sub are a bit more complicated in logic circuitry, because they need to test the carry bit to see if it overflowed, etc.. not just simple logical operations (which are implemented easily via gates), because of the carry, see this:

Code:
0 0
1 0
0 1
1 1    


logical operators (and, or, xor) need only do operations on bits, but add and sub must do a special case. here's add:

Code:
0 + 0 = 0
1 + 0 = 1
0 + 1 = 1
1 + 1 = 0  (carry)    


that is, they are a bit more complicated because of the carry. Wink
or am I wrong?

If it is like I said, why is mov eax, reg32 slower than xor eax, eax? (btw, reg32 is 0). Smile



one last phrase: can someone pls tell me how can I test it properly (at clock cycle level)? I always run the same test multiple times, and sometimes it gives me one method is faster, and the second time it is exactly reversed? dunno if I do something wrong.

sorry to bring this thread back to life (been 16 days or so 'till updated), but I didn't know where to post these questions. (and don't mind the overuse of smileys Very Happy )
Post 07 Jan 2006, 17:39
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17287
Location: In your JS exploiting you and your system
revolution
The_Grey_Beast: All your info about logic gates etc. is correct but unfortunately has no bearing on which instuctions are optimal and which are not.

The ALU (in the CPU) can (and does) all operations in the same time, i.e. AND, OR, TEST, XOR, ADD, SUB etc. all take the same time to execute in the unit.

The reason for the difference of certain instructions being preferred over others has only become relevant because of the super-scalar design of modern processors. With older processor (think Z80, 6502, 8086 etc.) such considerations are meaningless. But the modern P4 has other more dominationg factors that makes the choice of instructions important.

Some of the more important factors include cache sizes, resource allocation and dependency chains. The P4 has special hardware support to allow the XOR to break dependency chains. TEST does not allocate a writeback resource so other instructions can make use of extra resources in parallel. MOV reg,imm uses more space in the caches and can increase accesses to the slower main RAM.
Quote:
can someone pls tell me how can I test it properly (at clock cycle level)?
It is very difficult to measure the cycle time of an individual instruction. This is because there are so many other factors that will have a bearing on the result of the test. MOV, TEST, XOR all take the same time compute in the ALU but the things I mentioned above (cache, dependency chains, writebacks) will affect your timings too much the make an individual reading valid.

You need to test your code as-a-whole and see how it all interacts together. Singling out one instuction and trying to time it is very difficult indeed. I can be done but even then won't help you much if long latancies of one instruction can be hidden by other instructions executing in parallel. Sometimes IMUL can be executed in parallel with other instructions and make using IMUL a better choice for speed. It all depends on what is to be achieved as-a-whole.

Some simple general "rules" do apply: Always use XOR reg1,reg1 to clear if you don't need the FLAGS. Always use TEST reg1,reg1 over OR or AND. If you have a choice of instructions to do the same task, use the one that codes to less bytes. Try to avoid writing AH, BH, CH, DH the processor is not optimised for using these.
Post 08 Jan 2006, 03:47
View user's profile Send private message Visit poster's website Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
Thanks for clearing that up, revolution. Yeah, I know about caches, pipelining and parallel instructions, and it depends on where they are used. Better to select the smallest possible instruction to do the task.. Even if it is not the most 'performant' one to do it (at the specified context, of course), at least it's the smallest, and there's no doubt about memory..

Memory can be easily measured, but unfortunately speed is difficult and also varies between processors, so..

revolution wrote:
Try to avoid writing AH, BH, CH, DH the processor is not optimised for using these.


I find it a little disappointing, because the x86 architecture is one of the few that have these special registers. I mean, it's a bit of shame they are not optimized for these registers. Sad



thanks for clearing that up. Wink

Regards,
GreyBeast
Post 08 Jan 2006, 16:25
View user's profile Send private message Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
tom tobias wrote:
With regard to the two headed question:
XOR eax,eax, versus MOV eax, zero
this topic has dual aspects:
first: validity/utility of employing Boolean operators to perform a clearing operation,
second, time and/or memory savings with one or another version of the several methods to clear a register:
Scanning the results historically, on the FASM forum:
From March 2004:
http://board.flatassembler.net/topic.php?t=1207
and then, from February 2005:
http://board.flatassembler.net/topic.php?t=3055
more heat, though not necessarily more light in May 2005:
http://board.flatassembler.net/topic.php?t=3516
followed by this provocative dialogue in July 2005:
http://board.flatassembler.net/topic.php?t=3864
We finally arrive at the thread described in the sticky from December 2005:
http://board.flatassembler.net/topic.php?t=4485
regards, Smile
Post 02 Mar 2006, 08:47
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1171
Location: Overflow
Matrix
hehe, i see everybody programming assembly like to over complicate the codes, instead of writing more codes,

This is because we want to make soething 100% perfect, exact, fast.
though this might not be neccesary, you can use any of the mentioned register zero codes.

You do not need to zero out registers often...

Yep, this xor reg, reg has become a habit, but its ok i think, small, working.

btw: i have used and mem,0 too when i had to get some bytes there.
zeroing 4 bytes in one instruction, instead of mov mem,0
Post 02 Mar 2006, 21:50
View user's profile Send private message Visit poster's website Reply with quote
RedGhost



Joined: 18 May 2005
Posts: 443
Location: BC, Canada
RedGhost
Matrix wrote:
hehe, i see everybody programming assembly like to over complicate the codes, instead of writing more codes,

This is because we want to make soething 100% perfect, exact, fast.
though this might not be neccesary, you can use any of the mentioned register zero codes.

You do not need to zero out registers often...

Yep, this xor reg, reg has become a habit, but its ok i think, small, working.

btw: i have used and mem,0 too when i had to get some bytes there.
zeroing 4 bytes in one instruction, instead of mov mem,0


you are correct, the small size/timing difference is such a minute piece of the whole application it doesn't make that much of a difference, you can probably never see it in run-time, but maybe it's just me or my coding style who knows, but i find i have to zero registers quite often

_________________
redghost.ca
Post 04 Mar 2006, 09:26
View user's profile Send private message AIM Address MSN Messenger Reply with quote
CodeX



Joined: 08 Feb 2006
Posts: 20
Location: Estonia
CodeX
Common guys. This topic is already falling into "holly war" category.
If you're all using FASM, you should know that it has excellent macro capabilities. Why not define a macro command called "CLR" or "CLEAR" (or anything else well readable/understandable) which will actually clear it's arguments the way you think is the best one? Someone likes XOR-ing arguments, others prefer MOV eax, 0, or MOV eax, ebp (or any other command). All you have to do is place your favorite command into this macro. Doing so you'll have a conformity of code shared between different programmers, and also a good readability. Also you'll have an easy way to balance your program's speed/size optimization in different conditions by just changing macro content. Some processors performs some commands faster than others, or as somebody noticed there's also a difference between aligned and not aligned code, different loops and so on.
The same could be done as well for zero testing and other useful tricks.
By the way, what do you think is it a definitely good idea to create a forum called "Best practices" where we all can share information about these kind of code improvements?
Post 06 Mar 2006, 10:01
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4238
Location: 2018
edfed
the shr or shl 32 is a god idea because it's the shortest instruction
but i use alway the mov r,0 because it's explicit and i can modify the init valueas i want by mov r,m
but the fastest the xor as they say in the programmer refence manual of intel
i don't know how it is in the amd but for intel it's like this
Post 23 Apr 2006, 16:56
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.