flat assembler
Message board for the users of flat assembler.
Index
> Main > "mov eax,0" or "xor eax,eax" Goto page Previous 1, 2, 3, 4 Next |
Author |
|
RedGhost 15 Dec 2005, 11:10
decard wrote: Me too Xoring is something natural for me now, just like using "or" for testing if some register is zeroed. In Fresh source, John Found was always using "test" for that purpose. I just wonder, if there's any speed difference between "or" and "test"? yeah i have also come to use 'or' for testing jnz/jz and use that or other ways to test for jnz/jz more then i use cmp anymore _________________ redghost.ca |
|||
15 Dec 2005, 11:10 |
|
revolution 15 Dec 2005, 12:53
TEST is preferred over OR/AND because OR/AND will allocate an internal resource in the CPU for writeback to the register/memory and writeback to the flags. TEST only reads the register/memory and writes to the flags.
With superscalar architechure this is an important distinction and helps improve speed. |
|||
15 Dec 2005, 12:53 |
|
rugxulo 15 Dec 2005, 19:02
I was messing with DJGPP (GCC 3.4.3) yesterday, and when I used -S -masm=intel, I noticed a lot of TEST eax,eax and XOR ebx,ebx.
I used to always use OR (compare to zero) and XOR (clear) but switched to TEST and SUB recently just out of curiosity! [EDIT] Okay, I upgraded to GCC 3.4.4 then (reluctantly) to GCC 4.0.1 after being frustrated with trying to compile stupid Allegro (4.20 didn't work, but I think I got 4.03 compiled with GCC 3.2.3). I must say, newer versions of GCC do seem to break a lot of stuff (almost as many flaws as improvements). [/EDIT] |
|||
15 Dec 2005, 19:02 |
|
El Tangas 17 Dec 2005, 19:49
I actually had never noticed that the count in shift instructions was mod 32 . Guess shl(r) reg,32(64,96, etc) are just expensive nops...
How does it work in 64 bit mode, btw? |
|||
17 Dec 2005, 19:49 |
|
MazeGen 17 Dec 2005, 19:59
El Tangas wrote: How does it work in 64 bit mode, btw? For 64-bit operand, it is mod 64 |
|||
17 Dec 2005, 19:59 |
|
revolution 18 Dec 2005, 00:13
Quote: but switched to TEST and SUB recently just out of curiosity |
|||
18 Dec 2005, 00:13 |
|
r22 18 Dec 2005, 05:49
>On newer processors Test is faster than Or by a decent amount.
>64bit shifting is moded depending on the register 32 for 32bit 64 for 64bit. Its done this way because of the variable shift instruction SHR reg,CL, so if your using some algorithm to get the shift count you don't have to mod CL before hand. >I went from programming in VisualBasic to learning assembly and I had no problem with Xor reg, reg meaning clear reg. I don't think we give noobs enough credit in that respect. I like the xor edi,edx argument about it being hard to find a bug, but when you think logically it's a minority situation and therefore a logical fallacy to use it as justification. It would be like saying, "Smoking 12 cigars a day is healthy, because my uncle's cousin did it and he lived 80years." Even though there's overwhelming evidence to show that smoking 12 cigars a day is not healthy. People used to use assembly because it was the only way to optimize code to a speed where it would be usable. Now there's compilers coded by rooms of people with PhD's in math and computer science that can optimize code better than an experienced ASM coder (and if not better in at least a fraction of the time). I forgot the point of this so... sorry you read it. |
|||
18 Dec 2005, 05:49 |
|
revolution 18 Dec 2005, 09:55
Quote: Now there's compilers coded by rooms of people with PhD's in math and computer science that can optimize code better than an experienced ASM coder Quote: (and if not better in at least a fraction of the time). So many times the HLL compiler just doesn't understand what is the desired goal. You are forced to use the structure and commands of the particular language which is necessarily an abstraction from the actual problem at hand. You never see a program written that the author proudly states "we didn't use any assembly code because we wanted the best possible speed", it is just not the case. Indeed, the opposite is true, you can easily find people stating that they used assembly code in the critical loop because of speed. Even Microsoft use assembly for most of the very important and heavily used system calls, they keep different source code for each processor that they support. Sorry to flame on your comment like this, but stuff like this is always being stated by HLL users as facts when they never stop to think about what they are actually saying. Quote: It would be like saying, "Smoking 12 cigars a day is healthy, because my uncle's cousin did it and he lived 80years." Even though there's overwhelming evidence to show that smoking 12 cigars a day is not healthy. |
|||
18 Dec 2005, 09:55 |
|
r22 18 Dec 2005, 21:12
> rev, use vc++ or gcc to compile a stacked for loop, with the exception of mmx, SSE instructions HLL compilers have a solid optimization core for general purpose asm code. Of course the compilers can only do so much with the object oriented hoohaa and classes that they have to deal with, but thats the lnaguage not the compilers ability to optimize.
What I said holds for 32bit programming I haven't done enough testing with compilers on 64bit to make an educated statement. |
|||
18 Dec 2005, 21:12 |
|
rugxulo 19 Dec 2005, 22:20
A long time ago, I heard TEST reg,reg was better for older machines (286?) than OR reg,reg.
Besides, doing the following with GNU/DJGPP sed 4.1.4 should quickly change it back: Code: sed -i -e "s/[tT][eE][sS][tT] \+\(...\?\),\1/or \1,\1/g" myfile.asm P.S., r22, if you think a certain compiler can optimize so well, at least tell us which one you're referring to! |
|||
19 Dec 2005, 22:20 |
|
Madis731 20 Dec 2005, 08:40
Every compiler can if you let them (or if they have the set amount of parameters that you can change). The problem is when we see data structures and algoritms, the compiler only sees CODE and depending on the amount it gets fed at once it can make different judgements on that bunch and related to that - it takes time. We (human beings) can throw away stuff immediately if we can see it not working, but compiler brutally tries everything inspite of that.
|
|||
20 Dec 2005, 08:40 |
|
Borsuc 07 Jan 2006, 17:39
Madis731 wrote: Every compiler can if you let them (or if they have the set amount of parameters that you can change). The problem is when we see data structures and algoritms, the compiler only sees CODE and depending on the amount it gets fed at once it can make different judgements on that bunch and related to that - it takes time. We (human beings) can throw away stuff immediately if we can see it not working, but compiler brutally tries everything inspite of that. True. Compilers are mindless, they simply see what WE coded and try to optimize that. All they see is code (or some particular piece of info we tell them via source code). If I coded something for hidden surface removal (like Z-buffering, S-buffering) in 3D graphics, the compiler will just USE my algorithms and TRY to optimize them. but, I (or any other human being who understands what my code is supposed to do) can improve that algorithm accordingly, because I know what it's supposed to do, I know it's purpose, I know what it will actually do, and so I can improve it accordingly. The more you abstract something to a compiler (i.e object-oriented, etc..), the more it will have less information about the REAL code it needs to produce, so less optimization..... Simply, you can't create something from nothing. after all, that's why we are programmers. BTW: in the worst case, the asm programmer can simply take a look at the compiler's code and improve it. so he can ALWAYS optimize better than a compiler.. ...at least until computers begin to think (AI) and they will code everything in hexadecimal anyway. but that's probably not happening because all software companies (like big M$) will be totally useless, since we can easily tell the AI to: build me a Wind0ze with graphics here and there... where's all the fun? anyway, back on topic: Since this topic is about the fastest way to do a certain thing (at least I understood that from it ), I have to ask a few questions: seems xor reg, reg is the fastest way to clear a reg.. what about xor ah, ah? is it still faster than mov ah, 0 or mov ah, reg8 where reg8 is 0? probably yes, but mov does not require computations, does it? just move bits from one operand into another.. if anyone knows, please tell me. thx add reg, reg is faster than shl reg, 1? anyone tested, because I'm not experienced in testing at clock cycle level (always give wrong results ) and of course, a lot faster than imul reg, 2 I always thought test reg, reg is faster than or reg, reg because it doesn't write to register.. guess it's the case Dunno if this helps out, but: logical operators and and or are very simple to implement in circuit, usually requiring 2 or 3 gates. xor on the other hand, needs a bit more gates, but it is faster than and eax, 0 because it doesn't need an immediate value... add and sub are a bit more complicated in logic circuitry, because they need to test the carry bit to see if it overflowed, etc.. not just simple logical operations (which are implemented easily via gates), because of the carry, see this: Code: 0 0 1 0 0 1 1 1 logical operators (and, or, xor) need only do operations on bits, but add and sub must do a special case. here's add: Code: 0 + 0 = 0 1 + 0 = 1 0 + 1 = 1 1 + 1 = 0 (carry) that is, they are a bit more complicated because of the carry. or am I wrong? If it is like I said, why is mov eax, reg32 slower than xor eax, eax? (btw, reg32 is 0). one last phrase: can someone pls tell me how can I test it properly (at clock cycle level)? I always run the same test multiple times, and sometimes it gives me one method is faster, and the second time it is exactly reversed? dunno if I do something wrong. sorry to bring this thread back to life (been 16 days or so 'till updated), but I didn't know where to post these questions. (and don't mind the overuse of smileys ) |
|||
07 Jan 2006, 17:39 |
|
revolution 08 Jan 2006, 03:47
The_Grey_Beast: All your info about logic gates etc. is correct but unfortunately has no bearing on which instuctions are optimal and which are not.
The ALU (in the CPU) can (and does) all operations in the same time, i.e. AND, OR, TEST, XOR, ADD, SUB etc. all take the same time to execute in the unit. The reason for the difference of certain instructions being preferred over others has only become relevant because of the super-scalar design of modern processors. With older processor (think Z80, 6502, 8086 etc.) such considerations are meaningless. But the modern P4 has other more dominationg factors that makes the choice of instructions important. Some of the more important factors include cache sizes, resource allocation and dependency chains. The P4 has special hardware support to allow the XOR to break dependency chains. TEST does not allocate a writeback resource so other instructions can make use of extra resources in parallel. MOV reg,imm uses more space in the caches and can increase accesses to the slower main RAM. Quote: can someone pls tell me how can I test it properly (at clock cycle level)? You need to test your code as-a-whole and see how it all interacts together. Singling out one instuction and trying to time it is very difficult indeed. I can be done but even then won't help you much if long latancies of one instruction can be hidden by other instructions executing in parallel. Sometimes IMUL can be executed in parallel with other instructions and make using IMUL a better choice for speed. It all depends on what is to be achieved as-a-whole. Some simple general "rules" do apply: Always use XOR reg1,reg1 to clear if you don't need the FLAGS. Always use TEST reg1,reg1 over OR or AND. If you have a choice of instructions to do the same task, use the one that codes to less bytes. Try to avoid writing AH, BH, CH, DH the processor is not optimised for using these. |
|||
08 Jan 2006, 03:47 |
|
Borsuc 08 Jan 2006, 16:25
Thanks for clearing that up, revolution. Yeah, I know about caches, pipelining and parallel instructions, and it depends on where they are used. Better to select the smallest possible instruction to do the task.. Even if it is not the most 'performant' one to do it (at the specified context, of course), at least it's the smallest, and there's no doubt about memory..
Memory can be easily measured, but unfortunately speed is difficult and also varies between processors, so.. revolution wrote: Try to avoid writing AH, BH, CH, DH the processor is not optimised for using these. I find it a little disappointing, because the x86 architecture is one of the few that have these special registers. I mean, it's a bit of shame they are not optimized for these registers. thanks for clearing that up. Regards, GreyBeast |
|||
08 Jan 2006, 16:25 |
|
vid 02 Mar 2006, 08:47
tom tobias wrote: With regard to the two headed question: |
|||
02 Mar 2006, 08:47 |
|
Matrix 02 Mar 2006, 21:50
hehe, i see everybody programming assembly like to over complicate the codes, instead of writing more codes,
This is because we want to make soething 100% perfect, exact, fast. though this might not be neccesary, you can use any of the mentioned register zero codes. You do not need to zero out registers often... Yep, this xor reg, reg has become a habit, but its ok i think, small, working. btw: i have used and mem,0 too when i had to get some bytes there. zeroing 4 bytes in one instruction, instead of mov mem,0 |
|||
02 Mar 2006, 21:50 |
|
RedGhost 04 Mar 2006, 09:26
Matrix wrote: hehe, i see everybody programming assembly like to over complicate the codes, instead of writing more codes, you are correct, the small size/timing difference is such a minute piece of the whole application it doesn't make that much of a difference, you can probably never see it in run-time, but maybe it's just me or my coding style who knows, but i find i have to zero registers quite often _________________ redghost.ca |
|||
04 Mar 2006, 09:26 |
|
CodeX 06 Mar 2006, 10:01
Common guys. This topic is already falling into "holly war" category.
If you're all using FASM, you should know that it has excellent macro capabilities. Why not define a macro command called "CLR" or "CLEAR" (or anything else well readable/understandable) which will actually clear it's arguments the way you think is the best one? Someone likes XOR-ing arguments, others prefer MOV eax, 0, or MOV eax, ebp (or any other command). All you have to do is place your favorite command into this macro. Doing so you'll have a conformity of code shared between different programmers, and also a good readability. Also you'll have an easy way to balance your program's speed/size optimization in different conditions by just changing macro content. Some processors performs some commands faster than others, or as somebody noticed there's also a difference between aligned and not aligned code, different loops and so on. The same could be done as well for zero testing and other useful tricks. By the way, what do you think is it a definitely good idea to create a forum called "Best practices" where we all can share information about these kind of code improvements? |
|||
06 Mar 2006, 10:01 |
|
edfed 23 Apr 2006, 16:56
the shr or shl 32 is a god idea because it's the shortest instruction
but i use alway the mov r,0 because it's explicit and i can modify the init valueas i want by mov r,m but the fastest the xor as they say in the programmer refence manual of intel i don't know how it is in the amd but for intel it's like this |
|||
23 Apr 2006, 16:56 |
|
Goto page Previous 1, 2, 3, 4 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.