flat assembler
Message board for the users of flat assembler.

Index > Main > Optimizing Education...Please?

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias 30 May 2005, 11:50
beppe85 wrote:
....

And "xor r32, r32" is clear for me... You got it? "clear" hehehehe

Yes, thanks, always appropriate to invoke humor!! Well, and actually, this is PRECISELY the point, isn't it? The English word "clear", which in this context means to RESET a register with all bits assigned a value of zero, also implies TRANSPARENT, or EASILY UNDERSTOOD. My point then is this: THERE IS NO NEED to invoke a Boolean operator, like xor, to accomplish a reset. Yes, an incidental CONSEQUENCE of executing this instruction in the peculiar, and unnatural, situation where BOTH operands represent the SAME register, is to assign a value of zero to all bits, JUST AS IF that had been our intention from the outset. However, good programming practice does not include writing CODE, that is ILLOGICAL, and based upon accidental, implementation dependent sequelae. It is not only NOT NORMAL, it is NOT LOGICAL, to employ a BOOLEAN OPERATOR to clear a register. If the architects had instead decided, (in that extremely rare, and unusual circumstance where a BOOLEAN operator is used on itself, i.e. XOR EAX, EAX, ) to set all bits to a value of 1, instead of zero, would you then use ANOTHER Boolean operator, NOT, in order to RESET the consequence of XOR EAX, EAX, so that EAX could assume the value of zero? My point then, in brief, is this: If, as a programmer, you seek to RESET the contents of a register to zero, DO SO: MOV EAX, zero
Simple, readily understood, i.e. TRANSPARENT, not confounded with BOOLEAN logic, and, as beppe85 noted: CLEAR.
Post 30 May 2005, 11:50
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 31 May 2005, 02:09
If you have no reason to optimize your code for speed than there is no "logical" point to using assembly, as you could develope much faster using C/C++.

XOR reg32,reg32 is an optimization, if you can't think of it as being a synonym for MOV reg32,0 then that's a personal fault of yours not the person using it. You can equate this to an educated person with a large vocabulary when speaking to a child they would dumb their vocab down so that they could be understood (MOV reg32,0) but when speaking with a similarly educated person they would use more complex words (XOR reg32,reg32) to their point accross in the most EFFICIENT way.

Ambiguous ASM code should have comments to more clearly represent the purpose of the instructions. But using slower instructions in your code because of clarity is illogical (if you want clear code use a HLA with syntax that people can understand).

Expecting machine code to be written so EVERYONE can more easily understand it's purpose undermines the purpose of writing machine code in the first place (speed).
Post 31 May 2005, 02:09
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias 31 May 2005, 12:00
thank you r22 for your thoughtful ideas, much appreciated. I have some comments in response, but these thoughts are directed to the forum as a whole, not to you as criticisms.
First, your very provocative, BUT WIDELY ACCEPTED as valid, statement, that one can successfully complete a programming assignment with C/C++ (hll) more efficiently than with Assembly: sorry, I simply do not share this opinion, but, I do acknowledge that 95%+ of the world's programmers agree with you.
ASM is much easier to read and understand, than hll, in my opinion, when accessing any hardware component.
Second: I do not share your opinion that one is "dumbing down" to employ mov eax,zero, rather than xor eax,eax. However, if someone feels more productive, or more professional, or less amateurish to write xor eax, eax, instead of mov eax,zero, then, fine, no problem. Were such a person employed by me, I would then inquire: WHY does such an individual feel that this conduct represents a sign of superior skill? I would then inquire whether there may be OTHER instructions which perhaps ALSO assign a value of zero to a register. Does use of one of those instructions also convey a sense of superiority, or professionalism? Does such a person acknowledge the FACT that (mis)used frequently enough, to simply clear the contents of a register, xor will be overlooked BY OTHERS, reading this person's code, upon encountering the instruction in its intended setting: xor eax,ecx.
Third: machine code is not synonomous with Assembly, nor should the two be confounded in a discussion of the merits of Assembly.
Fourth: "But using slower instructions in your code because of clarity is illogical...". Hmm.
How do you know that mov eax,zero is SLOWER than xor eax,eax? Can you offer a benchmark study (which requires assignment of successively DIFFERENT, random values from memory, i.e. not data sitting in cache, prior to clearing the register by the two instructions)? Consider two identical programs, one written in asm in 1965 for the IBM 360, and the other written in hll (i.e. inefficient) in 2005, for one of the desktop cpu's widely available. Which program would execute faster??? Hardware improvements during the past four decades have fundamentally rendered moot the question of speed of execution. Rarely does one reject a program today because it executes too slowly. Therefore, instead of focusing on instruction speed, one ought to attend to the EASE of WRITING and READING a good program, defining good by virtue of facility for modification by another author. Smile
Post 31 May 2005, 12:00
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 31 May 2005, 19:13
EAX is set to a random 32 bit value and then cleared using Xor, this is repeated 0xFFFFFFFF times.
EAX is set to a random 32 bit value and then cleared using Mov, this is repeated 0xFFFFFFFF times.

This simple benchmark shows Xor clearing to be faster than Mov on a P4 3.2ghz 1gbRam machine.
I've reversed the execution (testing the Mov loop first and Xor second) just to make sure, the results remain the same, Xor is faster.

After multiple tests the results of Xor loop then Mov loop testing AND Mov then Xor testing are as follows.
(Smaller numbers mean faster)
OpCode Range of results in milliseconds
Xor: 54894 - 55175
Mov: 55835 - 56203

Code:
format PE GUI 4.0
entry start
include '%fasminc%\win32a.inc'

section '.code' code readable executable
start:
     call MakeSeed
  mov ebx,0FFFFFFFFh
  call [GetTickCount]
  mov esi,eax
.lp1:
     call Random32
     xor eax,eax
     dec ebx
     jnz .lp1
 call [GetTickCount]
 sub eax,esi
 mov esi,eax
 mov ebx,0FFFFFFFFh
 call [GetTickCount]
 mov edi,eax
.lp2:
     call Random32
     mov eax,0
     dec ebx
     jnz .lp2
 call [GetTickCount]
 sub eax,edi
     push eax
     push esi
     push fmt
     push buffer
     call [wsprintf]
     add esp,12
     push 4  ;yes/no yes continue / no end program
     push buffer
     push buffer
     push 0
     call [MessageBox]
  .ending:
     push 0
     call [ExitProcess]


Random32:
     push ebp
     push ebx
     mov ebp,RandomSeed
     mov eax,[ebp]
     mov ebx,[ebp+4]
     mov ecx,[ebp+8]
     mov edx,[ebp+12]
     shld ebx,eax,1
     adc eax,0
     ror eax,7
     bswap eax
     shld edx,ecx,1
     adc ecx,0
     rol ecx,5
     mov [ebp],eax
     mov [ebp+4],ebx
     mov [ebp+8],ecx
     mov [ebp+12],edx
     xor eax,ecx
     pop ebx
     pop ebp
     ret 0

SetSeed:
.seed equ esp+4 ;,+8,+12,+16
     movdqu xmm0,[.seed]
     movntdq dqword[RandomSeed],xmm0
     ret 16

MakeSeed:
     rdtsc
     mov edx,eax
     call [GetTickCount]
     mov ecx,eax
     mul edx
     mov [RandomSeed],eax
     xor edx,ecx
     mov [RandomSeed+4],edx
     bswap ecx
     xor eax,ecx
     mov [RandomSeed+8],eax
     not edx
     bswap edx
     mul edx
     mov [RandomSeed+12],eax
     ret 0



section '.data' data readable writeable
fmt db 'Xor: %lu  Mov: %lu',0
buffer rb 0ffh

align 16
RandomSeed dd 1318699, 1015727, 1235239, 412943

section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
          user32,'USER32.DLL'
      include  "%fasminc%\apia\kernel32.inc"
      include  "%fasminc%\apia\user32.inc"


section '.reloc' fixups data discardable  
    
Post 31 May 2005, 19:13
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias 31 May 2005, 20:30
First, I want to congratulate you on a job well done. I think it was EXCELLENT work, and I am impressed that you followed up on this topic. GOOD.
Umm, there are a couple of small points to address.
1. Let's assume, for sake of argument, that your benchmark study was perfect, and it is really quite good, so even if it had some small flaw, one can usefully draw some conclusions based on these results. My subsequent comments will focus on suggested "improvements", though, one person's improvements are another person's disposables!!! So, looking at your millisecond data,
Xor: 54894 - 55175
Mov: 55835 - 56203
we can see, very clearly, xor completes faster than mov, at least for this sequence of instructions. So, this certainly VINDICATES your position, in my opinion, you have succeeded in demonstrating, at least to my satisfaction, that I am wrong, and you are correct, xor is faster than mov.
What one might wish to ask about this data, though, is this: The delta, approximately 300-400 milliseconds over 55-56k milliseconds, represents about 0.5% increase in speed in the best case. In other words, is the obscurity associated with use of a boolean operator to clear a register, worth gaining such a modest improvement in execution speed, particularly realizing that the average program of 10,000 lines of code will have register clearing only about 10% of the time, i.e. 1000 total instructions? The time savings, while measureable, will not be noticeable by a human: (1000 x .005 x 2 clock cycles, (maximum), or about 1 microsecond slower with mov than with xor.
2. But, was this an ideal test??: I think a better test is one which creates a buffer of 10 million random 32 bit values (some number much larger than the 1 megabyte cache on a modern cpu.) After creating this large buffer, then, one has three groups of instructions to time: initially, (no clearing) assigning the first random value to the register, then, manipulating that value in some way, say, adding one to it, and repeat until finished. Then, for the second group of instructions, the register is cleared, by means of xor, and in the THIRD iteration, with mov. NO LOOPS in any of these three groups. ALL INLINE CODE (need a big editor, and a lot of cutting and pasting!!!) Timing is done as follows:
FIRST: load the first value into the register, perform the increment, BUT NO SUBSEQUENT CLEARING OF REGISTER, go to the next location for the second random value, and so on, repeated (but without loops), performed a million times (i.e. a million different 32 bit random numbers,) measure the time (I doubt it will be in milliseconds, just as well, for it is difficult to measure times of brief duration accurately.)
SECOND: load the first value, increment just as before, but then, CLEAR the register using xor, otherwise, exactly the same as first step, measure the time.
THIRD: load the first value, and increment just as before, but now, CLEAR the register using mov, otherwise, exactly the same as first step, measure the time.
I note that you were working in a Windows environment. This is not an appropriate testing environment for measuring execution times of these two instructions. You need a cpu with MINIMAL overhead, no interrupts, no polling, no i/o processes. Otherwise, random background is much too noisy. Best is NO OPERATING system. Apart from these minor, and trivial comments, good job. Smile
Post 31 May 2005, 20:30
View user's profile Send private message Reply with quote
UCM



Joined: 25 Feb 2005
Posts: 285
Location: Canada
UCM 31 May 2005, 22:21
This is what you could do for a better test:
a. Use fasm's repeat...
Code:
[body.asm]
repeat [some_large_amount]
 ...
end repeat
    

then, make a beginning of a com file...
Code:
[head.asm]
format binary
org 256
.... ....
cli
    

then, make a footer...
Code:
[ft.asm]
sti
...
    

and then, in WinNT's cmd.exe...
[code]
fasm head.asm head.bin
fasm body.asm body.bin
fasm ft.asm ft.bin
type head.asm >> prog.com
type body.asm >> prog.com
type body.asm >> prog.com
type body.asm >> prog.com
type body.asm >> prog.com
type body.asm >> prog.com
......
type ft.asm >> prog.com
then you have a nice long com file.

b.Copy your prog.com onto a floppy, then put DOS on it (making it bootable). Then, boot the floppy, and in realmode, execute prog.com.

PS. I don't think it matters whether you use mov or xor! The marginal speed improvement doesn't actually make any difference (unless you're looping like around a trillion trillion times clearing registers). It's only more satisfying, and also smaller.

_________________
This calls for... Ultra CRUNCHY Man!
Ta da!! *crunch*
Post 31 May 2005, 22:21
View user's profile Send private message Reply with quote
smiddy



Joined: 31 Oct 2004
Posts: 557
smiddy 01 Jun 2005, 01:28
Holy smokes ladys and gents. I didn't want to open a can of worms here. I am currently on Holiday and won't return until June 5th. I have some comments but alas I'm having too much fun to add to the current force of this thread. I will when I return...as it seems it won't die before then Wink

BTW, thanks everyone for your input I am learning quite a lot from your posts.
Post 01 Jun 2005, 01:28
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 02 Jun 2005, 08:56
Code:
times_4294967295:
XOR EAX,EAX ;111124clk 100.0% xor's got a very tricky logic
SUB EAX,EAX ;109257clk  98.3% carries make a long dependancy
MOV EAX,0   ;105052clk  94.5% too much memory overhead
AND EAX,0   ; 85680clk  77.1% and's got the sweetest logic
    

I don't believe this test either because my tests in a non-threaded environment show that xor's issue 50% faster than mov's no matter were it in a cacheable or cacheless state, because of memory-read bottlenecks. If you provide it with cache hints - you can get it as fast as 250% in the best case* scenario.

*This means that there are no bottlenecks in the µops so that xor handles different register in over any consecutive 3 instructions. mov on the other hand does not seem to give a rats a$$ about written registers so no tricks needed there Razz
Post 02 Jun 2005, 08:56
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.