flat assembler
Message board for the users of flat assembler.

Index > Main > XOR EAX,EAX

Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8  Next
Author
Thread Post new topic Reply to topic
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias 24 Jul 2007, 12:16
LocoDelAssembly wrote:
...I think you risk to loose easily that way Tom ...
Lose what? We enter this debate on the FASM forum, not to win or lose, but to learn. I am not going to be disappointed, or angry, or unhappy, or hurt, to learn that MOV is SLOWER, a LOT slower, than XOR. Right now, I am unsure whether MOV requires more time than XOR, or alternatively, ( if vid's results are reproduced with a better testing method,) about equally fast in clearing a register. I am very concerned about two points. I provided the quote from Intel, above, to signal a need to run a MEANINGFUL benchmark, prior to drawing any conclusions (and I don't think the 64 bit test above is meaningful), and secondly, it is not the execution times, but the software development times, (prolonged due to CODING practice, rather than PROGRAMMING practice) which is of primary concern to me.
LocoDelAssembly wrote:
The prefeching must be really good to keep the processor feeded completely all the time.
Oh. You prefer to develop a benchmark test for PREFETCH??? I am disinterested in that test. My suggestion, perhaps not well explained, was to develop a benchmark test ONLY for clearing a register, in which case, "Prefetch" goes out the window. We need to develop a test ONLY for clearing the register, a test which is unaffected by the prefetch algorithm. The issue is not "How fast will a cpu clear a register IF PREFETCH is active, versus inactive". The issue is whether or not XOR executes faster than MOV, or perhaps, how much time is saved by employing XOR, rather than MOV, or, more accurately, in terms of the central thrust of this thread: HOW MUCH SLOWER does MOV clear a register, compared with XOR? What is the penalty one must pay, in order to write coherent, easily read programs, using an instruction which has no invisible side effects, as XOR does, by affecting flags, concurrently with the primary clearing operation?
LocoDelAssembly wrote:

Can you provide the code that must be repeated across all the available RAM
No. Why should the instruction be dependent upon the location in memory, (so long as the values are in main memory, NOT CACHE)? My idea, which may be incorrect, that's why we have a forum, is this: To thwart the cache/pipeline/prefetch features of the cpu, say, for example, ten million clearing iterations, the program would place in memory the ten million clearing instructions with successive replacement of the random integer in each register, to be completed successively, WITHOUT ANY LOOPS. (ten million clearing instructions X 4 registers X 2--because one must also write a new random number each time-- X 4 bytes per location = 320 megabytes of RAM. At the end of the ten million clearing/copying new integer operations, the computer loudspeaker would squeek, or a signal would appear on the screen, so that the operator could note the time. My suggestion is for a small sized program that could write this bulky test program into memory, signal the operator to note the time, wait for the operator's confirmation, disable interrupts, commence the clearing activity (jump to the first location), reenable interrupts and signal program completion, so the operator could note the time. I believe, but may also be in error on this point as well, that the real time clock on the motherboard, will no longer serve as an accurate indicator of the elapsed time, if all interrupts have been disabled for a significant amount of time:
Small program (with loops, etc)-->
generate 10,000,000 (or whatever size is needed to obtain total execution time of about 10-30 seconds,) clearing operations on all four registers, writing in a NEW random integer into each register after it has been cleared. This integer value must be unpredictable, for each register, else the cpu will fetch the value from cache (I guess that is why the times are so similar on r22's test on vid's machine...) To test the time needed to clear, alone, one must also run the same test, without clearing, simply writing the random integers (always different values) into each of the four registers. Of course, there are two different versions ( or n different versions to test other methods of clearing a register), one which uses MOV, the other XOR.
LocoDelAssembly wrote:
About execution environment, I think we could use some of the OSes developed by fasm members and hack and strip them to our needs (apart of starting our own boot code).

EXCELLENT suggestion. Smile
Post 24 Jul 2007, 12:16
View user's profile Send private message Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2465
Location: Bucharest, Romania
Borsuc 24 Jul 2007, 12:38
Now having come back to this forum and seeing THIS thread still going? sorry for offtopic...

Tom, I understand the "readability" you seek, and it's importance for you. It is not something hard to understand.

What really bugs me from you, however, is that you keep saying "assembly can be as 'readable' as Pascal".

Of course it can, but why is there both Pascal & assembly? There are different tools for different ideologies and different programmers. You want readability, Pascal is your thing (or any other language that suits this).

Why shouldn't assembly be used for readability (for human logic by the way)? Because by definition, it is the computer's logic -- and readable for computers. We, as humans, are different and limited.

I still don't see a point in using assembly AT ALL if you want human-readability.


Look at this example:

-> You want readability --> assembly is slow --> A HLL is more readable and is faster because you're lazy enough, but at least the compiler will optimize for you.

-> You want small/fast/control --> assembly is perfect for this --> A HLL abstracts information and code for you.

The point is, there is absolutely no point in using assembly for readability. Not all programs are designed to be read by people like you, because not all people are both biologists, architects, astronauts, computer programmers, artists, etc.. etc..

It's like the saying in art: "You either have it or you don't, baby" (no pun intended).


the last phrase probably took me off a bit, eh it's been awhile since I was last on these forums Smile
Post 24 Jul 2007, 12:38
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 24 Jul 2007, 13:07
re: Vid
Thanks, so I guess on the mobile Core2's there's no XOR vs MOV benefit.

I ran the benchmark on a AMD x2 3800+.

Would be great if people with other 64bit processors ran the benchmark code and shared their results. Then we'd be able to tell if there's an overall speed benefit to using XOR over MOV for reg clearing or if it's only applicable to the older architectures.
Post 24 Jul 2007, 13:07
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
0.1



Joined: 24 Jul 2007
Posts: 474
Location: India
0.1 24 Jul 2007, 13:51
Much ado about ...
Post 24 Jul 2007, 13:51
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 24 Jul 2007, 15:32
vid: Perhaps it is the power saving that clocks down your processor? Try changing the code to make it call TestMov first and TestXor second.

Tom: About prefetching, if instructions are longer the prefetching capabilities of the processor are pushed to the limit and hence you risk to not have all the execution units of the processor busy. Although the benchmark don't care about this, still the MOV version could run slower by this. About using all memory is because replicating a code 10 millon times produces a big code that probably will use all the physical memory on some systems. The loop is to ensure that the requested amount of times can be executed independently of the amount of available RAM (the user would be warned about how much times the loop was unrolled). And yes, the RTC still works (which doesn't have millisecond precision), what it doesn't work is the clock maintained by the PIT's interrupt handler.

Anyway, if you will not provide the test code then we have nothing to test.
Post 24 Jul 2007, 15:32
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 25 Jul 2007, 00:36
Results on my Core2 E6300 where the same as Vids mobile Core2.
-I did also try running the MOV test before the XOR test and saw that the benchmark had a ~1.5% error.

So it appears that the optimization is not valid on the Core2, but useful on an AMD 64 (the new K10 architecture may remove the XOR benefit).

It's always more interesting when rants and dribble are backed up with code.
Post 25 Jul 2007, 00:36
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo 25 Jul 2007, 06:29
I can hear it now:

Quote:

AMD64 faster than Intel Core 2!


And then Intel goes out of business. Laughing
Post 25 Jul 2007, 06:29
View user's profile Send private message Visit poster's website Reply with quote
FrozenKnight



Joined: 24 Jun 2005
Posts: 128
FrozenKnight 25 Jul 2007, 08:46
AMD does seem to make really nice processors, They run a bit hot though.
Post 25 Jul 2007, 08:46
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 25 Jul 2007, 13:56
Feed Very Happy
XOV VS MOV 64-BIT Benchmark Started... (time in processor ticks)
Note: Req32 is used because the upper half of the 64bit register is cleared
Function1 time (xor r32,r32): 0x108565374
Function2 time (mov r32,0x0): 0x105023ED4
Percentage speed difference : -1.275221%

T7200 (mobile if you didn't know) and Server 2003 Enterprise x64 Edition not that it matters.

============AND============

XOV VS MOV 64-BIT Benchmark Started... (time in processor ticks)
Note: Req32 is used because the upper half of the 64bit register is cleared
Function1 time (xor r32,r32): 0x10A484DDC
Function2 time (mov r32,0x0): 0x10668FB58
Percentage speed difference : -1.475688%

E6700 (the last of FSB1066 series) and Server 2003 Enterprise x64 Edition not that it matters.

Both tests ran in a relatively clean environment. Only desktop showing with no taskbar mini-icons.

EDIT: Btw, try to put 6 NOPs in front of XOR loop like this:
Code:
times 6 nop
rdtsc
mov r14d,eax
mov r15d,edx
jmp xorlp
    

and you will get EQUAL results. NOTE!!! This is extremely dependent on OS and CPU architecture, also the given situation the thread is given the go-signal. That is the problem of 16-byte code cache alignment NOT the data alignment. XOR takes exactly 16 bytes, but MOV takes 22 so MOV doesn't fit in one but it must take two, but XOR may or may not fall inside this.
EDIT2: put add edx,eax instead of add rdx,rax and XOR beats MOV and changing every line of that code to use 32-bit registers makes no difference: XOR still beats MOV, but that is about as much fair as code that runs fast only on Intel or only on AMD Razz or ATI, nVidia. These fights have been held before and will be held in the future Very Happy
Post 25 Jul 2007, 13:56
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
levicki



Joined: 29 Jul 2007
Posts: 26
Location: Belgrade, Serbia
levicki 29 Jul 2007, 19:06
Hello everyone, I just had to register when I saw this discussion going on and on.

I am a devleoper whose main focus is code optimization. I write mainly in C/C++ and I also write in assembler using SIMD extensively and I am fluent in SSE, SSE2, SSE3, SSSE3 and SSE4.1.

Tom, I really do not understand how you can keep arguing about this.

You even cite book examples of assembler code which uses MOV eax, 0 as some sort of proof that MOV reg, 0 is preferred over XOR reg, reg which is a complete nonsense because anyone can find counter-examples to "prove" you wrong.

Today compilers are much more advanced than they were just few years ago. They are analyzing complex data flow in a program in a ways human being can match only with tremendous effort, and they are using all known micro-architectural shortcuts in order to make code execute as fast as possible. In other words, they are not pragmatic, but oportunistic.

Nowadays, there is no sense in using assembler for large portions of code where readability might be important. It is usually used sparingly in situations where compiler cannot optimize your high-level code to your satisfaction.

So, the main reason for using assembler is to squeeze extra performance from the underlying micro-architecture in order to meet performance level you have been asked to provide. That means readability is no longer the most important issue -- performance is.

You may argue that MOV reg, 0 is not much slower than XOR reg, reg, but when it comes to performance you cannot observe this independently of other code because various penalites accumulate and you get what it is called "death by a thousand papercuts" effect.

You may argue that XOR reg, reg purpose is not obvious. You have already been told to make a macro CLEAR(reg) or ZERO(reg) which internally does XOR reg, reg and you refused to consider it. From your stubborness on the subject, it is obvious that you are a keyboard warrior and an adamant brute who insists on imposing their own, often flawed, views on others.

In this case, your views and knowledge are old school but not in a positive way at all. Your evolution as an assembler programmer has obviously stopped at 386 code level which was many, many years ago.

Mind you, your views are not flawed because you are stupid which you clearly aren't, but because you are ignorant.

While you were happily ignoring progress, CPU micro-architectures have evolved several times. Modern CPUs have mind bogglingly complex OOO (Out-Of-Order) execution engines. Careless use of instructions and registers creates complex dependency chains and penalties on those modern CPUs, not to mention reduction of overall code and data throughput.

Did you know that when a modern CPU sees XOR reg, reg instruction, it automatically knows that code which uses that register following XOR instruction does not depend on the code using the same register before it? That is a hint which you cannot pass to the CPU by using MOV instruction so XOR is often used to break dependency chains in the code.

So Tom, I suggest you to start your "Back to the future" trip here:
http://agner.org/optimize/

If you are by any chance programming for modern CPUs made by Intel I would strongly advise you to read IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual. Relevant documentation for AMD CPUs also exists and I am sure you know how to use Google to find it, but most of the tricks explained in Intel's and Agner's manuals can be used on all modern CPUs and the only thing one has to consider is the level of SIMD support on a particular CPU which the code is targeting.

After reading all those manuals, you will hopefully be able to accept the fact that both MOV reg, 0 and XOR reg, reg have their valid place under the Sun.

If that doesn't help, and if after all this time you still haven't got yourself used to the XOR reg, reg, maybe it is time to retire and leave the real work to those who care about saving precious bytes and CPU cycles?

I strongly urge someone in power to sticky and close this thread because further discussion is pointless.
Post 29 Jul 2007, 19:06
View user's profile Send private message Visit poster's website MSN Messenger Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 22 Dec 2007, 22:37
Discussion about Intel C/C++ compiler continues here
Post 22 Dec 2007, 22:37
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 23 Dec 2007, 06:24
http://www.nytimes.com/2007/12/23/us/technology/23newinstruction.html?_r=1&ref=technology&oref=slogin

Santa Clara, CA - Today Intel announced an addition to the IA32 and IA32-64 instructions sets. For many years developers, coders and forum participants have been caught in an unending battle about standards, efficiency and readability of the instruction choices used in programs. To help stem the tide of vitriolic and sometimes abusive postings from all over the world, Intel have introduced a specialised instruction designed to ease the path of the affected programmers.

Paul Otellini, Intel CEO, said that "This new instruction will allow all developers to move forward in peace and harmony for ever and ever. The time has come to settle these long and deep arguments once and for all". The developer community is hailing it as a superb upgrade to the 29 year old instruction set. Industry analysts are picking it to be the "Best thing since sliced bread".

At the time of it's introduction the IA32 instruction set has been lacking a clear and concise way to do certain things. Some developers declared that readability was paramount, while others said efficiency in size and execution time was vital to any modern program and eclipsed any desire for other things.

With this new instruction Intel has raised the bar for other chip makers to follow. AMD has yet to officially comment but the inside word is that they will eventually follow like lambs and bow to the superiority of it's bigger brother but they may use a different mnemonic just so that they don't look like they are just followers.

Technical details of this new instruction are still not confirmed as the instruction is still under wraps. Experts in the field expect it to be unveiled on Dec-25, allowing Intel to give a Christmas present of goodwill to the world. An unnamed source has leaked the details to us and we can now share the joy with everyone.

An unnamed source wrote:
Code:
Opcode              Mnemonic
------              --------
30 /r/r             ZERO r8
31 /r/r              ZERO r16
31 /r/r             ZERO r32
REX.W 31 /r/r       ZERO r64    


Description:

Performs a zeroing operation on the destination operand. The destination operand must be a register and is repeated twice in the encoding /r/r.

In 64-bit mode, using a REX prefix in the form of REX.R permits access to additional registers (R8-R15). Using a REX prefix in the form of REX.W promotes operation to 64 bits.


Operation:

DEST <-- 0;


Flags Affected:

The OF, SF and CF flags are cleared; the ZF and PF flags are set. The state of the AF flag is undefined.


Protected Mode Exceptions:

#UD If the LOCK prefix is used.


Real-Address Mode Exceptions:

#UD If the LOCK prefix is used.


Virtual-8086 Mode Exceptions:

#UD If the LOCK prefix is used.


Compatibility Mode Exceptions:

Same exceptions as in protected mode.


64-Bit Mode Exceptions:

#UD If the LOCK prefix is used.
It has been suggested that AMD will use the mnemonic CLEAR instead of ZERO.

Intel also announced that the new instruction will retroactively be made valid on all of it's previous IA32 architecture chips by using a previously unannounced secret remote access backdoor to update all processors from 8086 and up without needing to remove it from the system.
Post 23 Dec 2007, 06:24
View user's profile Send private message Visit poster's website Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias 23 Dec 2007, 13:03
revolution wrote:
Actually I meant to lock the XOR EAX,EAX topic. This split thread [referring to the C/C++ nonsense over in HEAP] still [has] some life in it yet, but limited I fear.
hmm. Seems that revolution, kindly of course, feeling the christmas cheer, decided to liven up this "moribund" thread, concerning XOR eax, eax, by offering a hoax. Well, no matter. Life goes on. I confess to having been obliged to look twice at my calendar, to make sure that it was 23 december, not 01 April.
I remain convinced, more than ever, that readability trumps execution speed, and that software development time is far more critical, than application throughput. Designing and writing good programs, as opposed to sloppily written code, is not trivial, and independent of the language used. For those (many) who believe that homo sapiens is so sapiens that he (or she, for 0.1) arrives on planet earth with an innate ability to distinguish XOR eax, ecx from a page full of XOR eax, eax, here's a genuine link to a real article, which lays to rest the myth of homo's self annointed brilliance:
http://news.nationalgeographic.com/news/2007/12/071203-AP-chimp-memory.html
Thank you very much Loco, for splitting the thread, and for (again) requesting the benchmark code, which I have not forgotten about. This thread is moribund only in the minds of those who believe they are so intelligent that they need not (or should not!) write clearly, but can prosper with muddled garfle mixed up with obfuscated nonsense. Why would we want to "lock" any thread on the FASM forum? This issue (readability versus execution speed) is at least as important today, in the era of multicore cpu's as it ever was in bygone days of single core computing.
Smile
Post 23 Dec 2007, 13:03
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 23 Dec 2007, 13:18
This thread: Very Happy
Code:
MerryGoRound:
  Optimise for size
  Optimise for speed
  Optimise for readability
  jmp MerryGoRound    
Post 23 Dec 2007, 13:18
View user's profile Send private message Visit poster's website Reply with quote
FrozenKnight



Joined: 24 Jun 2005
Posts: 128
FrozenKnight 29 Dec 2007, 11:40
[quote="tom tobias"]
revolution wrote:

Thank you very much Loco, for splitting the thread, and for (again) requesting the benchmark code, which I have not forgotten about. This thread is moribund only in the minds of those who believe they are so intelligent that they need not (or should not!) write clearly, but can prosper with muddled garfle mixed up with obfuscated nonsense. Why would we want to "lock" any thread on the FASM forum? This issue (readability versus execution speed) is at least as important today, in the era of multicore cpu's as it ever was in bygone days of single core computing.
Smile


tom tobias it's not that we feel that we are smarter, we just feel that if readability is to be primary for programming ASM, then what is the advantage to using ASM over other more readable languages like Basic, C or Python?

But since you think you need to have readable ASM programs I'll let you waste your time trying to make probably one of the more unreadable languages readable. IDK what you do with your time, my only use for ASM is to trick the CPU to preforming beyond what a compiler can output. other than that I'll use a compiler.
Post 29 Dec 2007, 11:40
View user's profile Send private message Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak 29 Dec 2007, 18:12
revolution wrote:
This thread: Very Happy
Code:
MerryGoRound:
  Optimise for size
  Optimise for speed
  Optimise for readability
  jmp MerryGoRound    


Readability is BS as far as i'm concerned. Comments fix all in that regard, and if you can't read the code you shouldn't be programming in that programming language. ASM is the best for readability in that regard, since there are no 1 liners.
Post 29 Dec 2007, 18:12
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 30 Dec 2007, 01:52
kohlrak wrote:
ASM is the best for readability in that regard, since there are no 1 liners
Haha, yes there are.
Post 30 Dec 2007, 01:52
View user's profile Send private message Visit poster's website Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak 30 Dec 2007, 05:05
Well... There aren't any one-liners that do more than making a raw text file...
Post 30 Dec 2007, 05:05
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 30 Dec 2007, 05:07
kohlrak wrote:
Well... There aren't any one-liners that do more than making a raw text file...
With irp any program can be written in one line. No need to restrict oneself to just display directives you know.
Post 30 Dec 2007, 05:07
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20299
Location: In your JS exploiting you and your system
revolution 30 Dec 2007, 05:11
e.g.
Code:
irp x,include 'win32ax.inc',.code,start:,<invoke MessageBox,HWND_DESKTOP,"Hi! I'm the example program!","Win32 Assembly",MB_OK>,<invoke ExitProcess,0>,.end start{x}    
Post 30 Dec 2007, 05:11
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.