flat assembler
Message board for the users of flat assembler.

Index > Main > Optimizing Education...Please?

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
smiddy



Joined: 31 Oct 2004
Posts: 557
smiddy 24 May 2005, 14:35
Hi,

There are several of you who have made a huge impression on me by your coding. I am so elementary in comparison. So in order to help learn better style and optimization techniques I am posting a routine that converts ASCII Hex Characters to a number in hopes that I can do better by allowing you all to riducule my elementary style of coding. I used a top down approach and perhaps as well as very basic code. The thought process here or algorythm is to check the string length (max 8 for the 32bits used...could perhaps be more, if an when I learn 64-bit registers et al), multiply each nibble from left to right by the corresponding exponent. In other words, if the string is 8 characters long, the first character would be multiplied by 16^7 (i.e. 76543210 is string, then 7 x 16^7). I'm certain that this can be done much faster and with better ASM style. Please give me some insight into your thought process and how you would approach this:

Code:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; ASCIIHexToNumber - Converts ASCII Hex values to a number
;;
;; Input: ESI; points to string containing the ASCII Hex Text
;;
;; Output: EAX; a 32-bit number
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

Nibble0                 dd 1                    ; 16^0
Nibble1                 dd 16                   ; 16^1
Nibble2                 dd 256                  ; 16^2
Nibble3                 dd 4096                 ; 16^3
Nibble4                 dd 65536                ; 16^4
Nibble5                 dd 1048576              ; 16^5
Nibble6                 dd 16777216             ; 16^6
Nibble7                 dd 268435456            ; 16^7

ReturnNumber            dd 0                    ; Variable to save and add up each character and place into EAX upon return
ConvertedCharacter      db 0                    ; Temporary holder for converted character to a number (multiplier)

ASCIIHexToNumber:

        push edi
        push ebx
        push ecx
        push edx

        mov edi,esi                             ; StringLength uses EDI, save ESI to EDI
        call StringLength                       ; Returns the string length in ECX
        mov edx,0                               ; Zero out EDX
        mov [ReturnNumber],dword 0              ; Zero out ReturnNumber

.LoopToAdd:

        mov bl,[esi + edx]                      ; Copy first character to BL
        cmp bl,57                               ; Compare BL versus '9'
        jg .ALetter
        sub bl,48

.DoConversion:

        mov [ConvertedCharacter],bl             ; Save converted character
        movzx ebx,byte [ConvertedCharacter]     ; Zero fill EBX with converted character 
        mov eax,[Nibble0 + (ecx - 1) * 4]       ; Place current nibble multiplier into EAX
        push edx                                ; Save EDX counter
        mul ebx                                 ; Multiply converted character by current nibble saves to EDX:EAX (won't need EDX)
        pop edx                                 ; Restore EDX counter
        add [ReturnNumber],eax                  ; Save current converted character to number by adding it to ReturnNumber
        inc edx                                 ; Increase EDX counter
        dec ecx                                 ; Decrement ECX counter
        cmp ecx,0                               ; Compare for zero
        jg .LoopToAdd
        jmp .Done

.ALetter:

        cmp bl,70                               ; Compare BL to upper case 'F'
        jg .LowerCase                           ; If greater then it is lower case
        sub bl,(65 - 10)                        ; Subtract 'A' and add 10 (BL - [65 - 10] = Multiplier)
        jmp .DoConversion                       ; Go convert the character to a number

.LowerCase:

        sub bl,(97 - 10)                        ; Subtract 'a' and add 10 (BL - [97 - 10] = Multiplier)
        jmp .DoConversion                       ; Go convert the character to a number

.Done:

        mov eax,[ReturnNumber]                  ; Save ReturnNumber into EAX for return

        pop edx
        pop ecx
        pop ebx
        pop edi

        ret
    


P.S. Be gentle in your ridulule of my code... Wink
Post 24 May 2005, 14:35
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 25 May 2005, 06:48
This was the best I could come up with.
If a couple assembler veterans post their code on this thread we can break out the benchmarking :D.

Code:
HexStr2Bin:
.strPtr equ esp+4
     push ebp
     push ebx
     mov ebp,[.strPtr+8] ;get str ptr
     lea edx,[ebp-1]       ;copy & DEC str ptr
  .len:
     add edx,1
     cmp byte[edx],0
     jnz .len                  ;edx = last byte in str NULL
     xor ecx,ecx
     xor eax,eax
  .conv:
     dec edx                 ;loop through the string backwards
     cmp edx,ebp          ;done when edx is less then ebp
     jl .ret
     movzx ebx,byte[edx]  ;get the byte
     sub bl,30h ;65-48=17=7 97-48=49-7=42=32
     cmp bl,17             ;not exact number to use but works
     jl .continue
     sub bl,7
     cmp bl,17             ;not exact number to use but works
     jl .continue
     sub bl,32
  .continue:          ;the byte should be between 0-15
     shl ebx,cl        ;shift left AKA multiply by power of 16
     add eax,ebx    ;make the return value
     add cl,4           ;jump up to a new power of 16 for shift left (multiply)
     jmp .conv
  .ret:
     pop ebx
     pop ebp
     ret 4  
    
Post 25 May 2005, 06:48
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 25 May 2005, 08:25
Code:
;<<<-c810----CODE TO CONVERT STRING TO BINARY--------------->>>
toBinary:                      ;ebx=base,esi=source,edi=dest
    xor  ecx,ecx               ;clear values so we can add up
    xor  eax,eax               ;to them later
  .nxtnmbr:
    mul  ebx                   ;multiply with base so we can
    mov  cl,[esi]              ;get another number to add to
    sub  cl,30h                ;lower part of it.
    cmp  cl,10
    jc   .nmbr                 ;0..9?
    sub  cl,7                  ;make A..Z=>10..35
  .nmbr:
    cmp  ecx,ebx               ;is the input bigger than base-1
    jc   .inlimit
    cmp  [chng],-1             ;Did we come here in previous
    jz   .inlimit              ;loop?
    mov  [chng],-1             ;No disturbance in the next loop
    div  ebx                   ;pre-divide by base (needed!)
    push eax                   ;push eax/base
    mov  eax,ebx               ;base in eax for multiply
  .test:
    mul  ebx                   ;eax*=base
    push eax
    mov  eax,edx
    mul  ebx
    mov  edx,eax
    pop  eax
    cmp  ecx,eax               ;is the input still bigger?
    jnc  .test
    mul  dword[esp]            ;it wasn't so return our eax
    add  esp,4                 ;instead of popping we emulate
  .inlimit:                    ;because pop reads memory
    mov  [chng],0              ;continue when input is in the
    add  eax,ecx               ;limits of 0..base-1
    inc  esi                   ;point to next char
    cmp  [esi],byte 0          ;0 is the end marker
    jnz  .nxtnmbr              ;loop if non-zero character
    mov  [edi],eax             ;result to destination pointer
    mov  [edi+4],edx
ret
;<<<-c81F----END OF CODE------------------------------------>>>
    

Code:
;<<<-c820----CODE TO CONVERT BINARY TO STRING--------------->>>
toString:                      ;eax=base,BINdata=source,
    cmp  eax,2                 ;STRdata=destination
    jnc  .cont                 ;base_2 - the lowest that exists
    mov  [STRdata],"<=LO"      ;Easier than defining data
    mov  [STRdata+4],"W!"      ;yet perfectly legal
    ret
  .cont:
    cmp  eax,37                ;base_36 defined as maximum in
    jc   .cont2                ;this program
    mov  [STRdata],"<=HI"
    mov  [STRdata+4],"GH!"
    ret
  .cont2:
    push ebp                   ;We passed the checks
    xor  ebx,ebx               ;clear some registers
    xor  esi,esi               ;...
    xor  edx,edx               ;.
    mov  ebp,STRdata           ;define important pointers
    mov  edi,eax               ;can't be in eax, because it is
    mov  eax,[BINdata]         ;used for calculations mul/div
  .again:
    div  edi                   ;dividing by base gives us
    lea  ecx,[edx+30h]         ;numbers in reverse order
    cmp  ecx,3Ah               ;They must be translated to
    jc   .push                 ;readable characters and pushed
    add  ecx,07h               ;to stack so we can easily
  .push:                       ;reverse them
    mov  ch,cl                 ;byte-by-byte: every word push
    push cx                    ;erases cl so ch preserves it
    inc  esp                   ;this allows to virtually push
    inc  ebx                   ;bytes (instead of words)
    xor  edx,edx               ;clear remainder of division
    cmp  eax,0                 ;eax==0 ? nothing to divide
    jnz  .again
  .test:
    sub  ebx,4                 ;popping is hard
    jc   .ovr                  ;n%4==0 ? lets pop 4 bytes
    pop  [STRdata+eax]         ;dword pop
    add  eax,4                 ;correct eax
    jmp  .test                 ;do it again
  .ovr:
    pop  [STRdata+eax]         ;pop dword
    sub  esp,4                 ;correct stack
    cmp  ebx,-3                ;check how many we left there
    jc   .exit                 ;none? ok, we can finish
    jz   .three                ;jump accordingly
    cmp  ebx,-2                ;now checking to get additional
    jz   .two                  ;flags
    inc  esp                   ;only 1
    inc  eax
  .two:                        ;2..
    inc  esp
    inc  eax
  .three:                      ;3..
    inc  esp                   ;(inc esp)x3 but we increased
    inc  eax                   ;one too many in the beginning
  .exit:                       ;so it's a whole dword
    mov  [STRdata+eax], 0      ;string must end with 00h
    pop  ebp
ret
;<<<-c82F----END OF CODE------------------------------------>>>
    


These are N conversions - 2..36 ASCII<=>BIN

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 25 May 2005, 08:25
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
smiddy



Joined: 31 Oct 2004
Posts: 557
smiddy 25 May 2005, 09:28
r22 & Madis731,

You've both given me a few things to think about for sure. You both use operations I normally don't so this will give me the opportunity to preview them in a working state, thanks! Both of you shortened variable names. Is this considered normal style? The reason I ask is it seems like, in FASMs case, you can take the luxury of using names that are far more descriptive. I suspect this style comes from coding from way-back when variable lengths where limited?

I like the idea of checking the speed too. I only know a few things about speed. From a lot of your posts, it seems you have ideas, I think, on hyperthreading (or multiprocessor) speed increases. I say this based on the dependancies between operations and the need for one to complete before another operations. I suspect for parallel processing.

Is there a quick reference on clock cycles for processor types on the net? I know for example XOR EAX,EAX has less overhead in drive space, but is it a faster operator than MOV EAX,0?

Enough questions for now...I have to run to work. Thanks again for the code examples. I'll study them more closely this evening.
Post 25 May 2005, 09:28
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 25 May 2005, 19:59
No short names are bad and you should use as long ones as neccessary, but NOT too long that may confuse people.
Why I'm used to this bad habit? I really don't know Very Happy, but for my defence - .toString can only be translated in one way you can convert binary to string conserning numbers Smile.

You must thank Agner Fog for the next one:
http://www.agner.org/assem/
The file in the attachment is result of his work.

Also you should check out 9. "64bit BCD add" from http://www.azillionmonkeys.com/qed/asmexample.html
because if I ever wanted to optimize - I would loose the compares to check for characters in a HEX ASCII string. "Push them over the edge" so-to-speak.

This topic is too tricky to tell you to go search the board or manuals so here it goes:
-------------------------
XOR was to be thanked in the 16-bit ages where MOV reg,const would take 5 bytes so taking 3 CPU reads from the memory.
Today (32-bit) it is not so important (though still faster and smaller!), because 32 bits (read 4 bytes) can be read at once so only two reads are neccessary. In case of an XOR reg,reg you can clear 2 registers in one memory-read if you're lucky(meaning 4 byte aligned code).

Example is appropriate:
Code:
align 4
mov eax,0 ;B8 00 00 00 00
mov edx,0 ;BA 00 00 00 00
;Result=10 bytes -> 3 to 4 memory-reads

mov eax,0   ;B8 00 00 00 00
mov edx,eax ;8B D0 (STALL!)
;Result=6 bytes -> 2 to 3 memory-reads

xor eax,eax ;33 C0
xor edx,edx ;33 D2
;Result=4 bytes -> 1 to 2 memory-reads
    

I would use XOR for speed+size
I would use MOV for readability (maybe speed?)

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 25 May 2005, 19:59
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 25 May 2005, 20:08
Sorry for double posting, but you should know another good trick when you want to avoid bad dependency checking with XOR reg,reg:
Code:
and eax,0 ;83 E0 00 <= notice the optimized instruction
and edx,0 ;83 E2 00 <= 1-byte 00 clears the whole 32-bits
;Result=6 bytes -> 2 to 3 memory-reads
;There is no stall and should heat less CPU because of the
;simpler circuitry of AND compared to XOR Very Happy
    


Did some preliminary testings: you first code was not so bad ~325±5 clocks. Your use of time-taking instructions for shorter code didn't pay off it makes unexpected stalls in parts of your code.

I hoped it would be much more improved but it only improved 50% by the second try on that code - r22: 167±0 clocks. I like the stability though BIG "+".


Last edited by Madis731 on 25 May 2005, 20:27; edited 1 time in total
Post 25 May 2005, 20:08
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
FlashBurn



Joined: 06 Jan 2005
Posts: 87
FlashBurn 25 May 2005, 20:15
The best algo I know is this:

Code:
xor ebx,ebx
xor eax,eax
.loop:
lodsb
test al,al
jz .end
make a binary number out of the ascii byte
shl ebx,4
or ebx,eax
jmp .loop
.end:
mov eax,ebx
ret
    
Post 25 May 2005, 20:15
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 26 May 2005, 06:02
The next code is not wrapped inside CALL|RET so its not fair to test it side-by-side with other snippets, but still it only takes 111 clocks. I think that memory reads/writes and call-ret overhead won't be much. Oh, and the code itself ofcourse:
Code:
mov edi,"F3D4" ;Constant1
mov eax,edi
and eax,40404040h
shr eax,6
imul ebx,eax,7
;lea ebx,[eax*4+eax]
;lea ebx,[eax*2+ebx]
sub edi,ebx
mov eax,edi
and eax,000F000Fh
and edi,0F000F00h
ror eax,12
rol edi,8
or  eax,edi
mov edi,eax
shr edi,8
or  eax,edi
mov edi,"CA9E" ;Constant2
mov edx,edi
and edx,40404040h
shr edx,6
imul ebx,edx,7
;lea ebx,[esi*4+esi]
;lea ebx,[esi*2+ebx]
sub edi,ebx
mov ebp,edi
and ebp,000F000Fh
and edi,0F000F00h
ror ebp,12-8
rol edi,8+8
or  ebp,edi
mov edi,ebp
shl edi,8
or  ebp,edi
shl eax,16
shr ebp,16
or  eax,ebp
    

I haven'toptimized it neither for speed nor size, because it was late last night and I just wanted to get it raedy. Very Happy ...but I fell asleep anyway so...your comments|critisism, please!
Post 26 May 2005, 06:02
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
smiddy



Joined: 31 Oct 2004
Posts: 557
smiddy 26 May 2005, 11:50
Madi731,

Thanks, I have placed http://www.azillionmonkeys.com/qed/tech.shtml in my book marks for sure. There are a few pages there I hadn't seen yet. I took a few glimpses at Paul Hsieh's pages last night. I will study them as well.

So you are aware, my background is hardware (EE), so the digital info makes a lot of sense. Particularlly the ANDing of a zero to a register. I suppose there are a lot more boolian tricks that can be played. That probably comes with more experience I'm certain.

This evening if I get the opportunity I'll take a closer look at your last code fragment. It is very tight indeed. Thanks again!

Also, when I posted my procedure, I hadn't had the opportunity to test it myself until last night. I suppose, luckily, it worked the first try. Wink From the standpoint of readability, my peice does that quite well I think. Although a 2:1 increase as you say isn't huge for one pass, but say with a PCI table conversion for ASCII Hex to number, with nearly 5,000 devices, 330 clocks/device x 5,000 devices/list = 1,650,000 clocks/list versus 825,000 clocks/list. On an old machine, that would be maddening to wait on, especially since their clocks/operation were longer too. This is good stuff...can you see my excitement? Very Happy

FlashBurn,

Thanks for the snippet. I'll take a closer look at it this evening as well.
Post 26 May 2005, 11:50
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 26 May 2005, 16:45
Its time for MMX:
I found this algorithm to be very interesting, because there is no such instruction that would pack BYTES=>NIBBLES. I would write a 10clock program then Razz
Code:
movq     mm0,[ASCIIhex]
movq     mm1,[Compare1]
movq     mm2,[Compare2]
movq     mm3,[Compare3]
movq     mm4,[Compare4]
movq     mm5,mm4
psllw    mm5,8
pcmpgtb  mm1,mm0
psubb    mm0,mm2
pandn    mm1,mm3
psubb    mm0,mm1
movq     mm1,mm0
pand     mm0,mm4
pand     mm1,mm5
psrlw    mm1,4
por      mm0,mm1
packuswb mm0,mm0
movd     eax,mm0

ASCIIhex dq "F3D4CA9E"
Compare1 dq 4040404040404040h
Compare2 dq 3030303030303030h
Compare3 dq 0707070707070707h
Compare4 dq 00FF00FF00FF00FFh
    
Post 26 May 2005, 16:45
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
smiddy



Joined: 31 Oct 2004
Posts: 557
smiddy 26 May 2005, 20:52
WOW!

I haven't opened the can of 64-bits yet. I am making a few assumptions that most common 32-bit/16-bit operations are available with a prefix or suffix? Are there specifics for each register too, like MUL and DIV which use multiple registers (this leads into FPU operations too, to which I have a little experience with). Please forgive my elementary questions. I intend on getting a good book (if you know of one, please let me know) on the different instructions sets. The one book (which is fairly worn) Using Assembly Language 2nd Edition by Allen L . Wyatt (C) 1990. I've had it that long...but it doesn't have the MMX, SSE, etcetera. Although it does have quite a bit (pun intended) on FPU operators. Coding examples for most of the 32-bit operations, for example SETNAE, is just a simple:

Code:
   SETNAE CL
    


Without the ability to see specific flow it becomes tough to follow, unless, and I should probably do this, I use the instruction(s) one at a time to get to know them better.

Again <cough> thanks for the information. I have a lot to go over just from this thread alone.
Post 26 May 2005, 20:52
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 27 May 2005, 04:00
I was disappointed by the MMX code :[
1 ) It reads the strings BACKWARDS
string db 'F','F','F','F',0 ;non mmx versions = 65535
string1 db 'F','F','F','F','0','0','0','0',0 ;mmx version = 65535
2 ) Also it only handles UPPERCASE letters.

Fixing the above two problems makes the MMX code slower than the following.

Using the optimizations from others on this thread I tweaked my original code. I'm pretty sure this code can be easily ported to 64bit simply by replacing the E's with R's.

Code:
align 8
HexStr2Bin:
.strPtr equ esp+4
     mov ecx,[.strPtr]
     xor eax,eax
  .conv:
     movzx edx,byte[ecx] 
;more speed IF following 4 lines can be DE-branchified
     sub dl,30h
     js .ret                      ;the byte is null return
     test dl,11110000b     ;check for A-F a-f
     jz .ctn                      ;not an upper/lower case char skip to ctn:
     and dl,00001111b     ;make A-F a-f  = 0-5
     add dl,9                   ;correct 0-5 to 10-15
  .ctn:                           
     shl eax,4                 ;shift eax to the next power of 16
     add ecx,1                ;increment the string ptr
     or al,dl                    ;OR in the new value
     jmp .conv                ;continue the loop
  .ret:
     ret 4
    


The unrolled version...
Code:
align 8
HexStr2Bin:
.strPtr equ esp+4
     mov ecx,[.strPtr]
     xor eax,eax
  .conv:
     mov edx,[ecx]
     sub dl,30h
     js .ret
     test dl,11110000b
     jz .ctn
     and dl,00001111b
     add dl,9
  .ctn:
     shl eax,4
     or al,dl
     shr edx,8
     sub dl,30h
     js .ret
     test dl,11110000b
     jz .ctn1
     and dl,00001111b
     add dl,9
  .ctn1:
     shl eax,4
     or al,dl
     shr edx,8
     sub dl,30h
     js .ret
     test dl,11110000b
     jz .ctn2
     and dl,00001111b
     add dl,9
  .ctn2:
     shl eax,4
     or al,dl
     shr edx,8
     sub dl,30h
     js .ret
     test dl,11110000b
     jz .ctn3
     and dl,00001111b
     add dl,9
  .ctn3:
     shl eax,4
     add ecx,4
     or al,dl
     jmp .conv
  .ret:
     ret 4
    


Well, I'm done kicking a dead horse.
Post 27 May 2005, 04:00
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 27 May 2005, 08:44
I don't know what is the matter with my CPU or my RAM, but the tighter one takes 162 clocks and the unrolled onw 201 clocks - does the unrolling mess up my CPUs branch prediction?
Post 27 May 2005, 08:44
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias 27 May 2005, 10:02
smiddy wrote:
I know for example XOR EAX,EAX has less overhead in drive space, but is it a faster operator than MOV EAX,0?



May I ask how you know this????
I am laboring under the possibly FALSE impression, that most of the information about timings of instructions is folklore based on 80386 days.....
I have writtent this before, and I will write it again. Even if
XOR EAX, EAX is
ONE THOUSAND times faster than
MOV EAX, Zero,
I will still employ the latter expression EVERY time.
So many myths pervade this field. One of those myths is the silly notion of speed. Faster is better. Sure. Faster is better, so is power saving.
In fact, as between saving electricity, and reducing execution time, I suspect that for MOST OF THE world's population, saving electricity is far more important than saving execution time. Unless one plans to turn off the computer sooner, saving electricity, as a result of this nanosecond faster computation, one should stick to MOV EAX, Zero.
Computers today are so fast, that saving a few clock cycles at the cost of IMPROPER programming style (substituting unreadable CODE for easily understood PROGRAM) is an UNFORGIVEABLE error.
But, apart from the argument that readability reigns supreme in evaluating alternate programs to accomplish the same goal, WHERE'S the timing data to support the notion that XOR EAX, EAX executes faster than MOV EAX, Zero???? In particular, where are the benchmarks published? Where's the url to a web site that explicitly defends this obsolete practice, employed religiously by legions of assembly language programmers, who no longer understand WHY Intel developed Boolean and bitwise operators in the first place! (to save precious memory!) Regards, tom Smile
Post 27 May 2005, 10:02
View user's profile Send private message Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen 27 May 2005, 11:33
tom tobias wrote:

But, apart from the argument that readability reigns supreme in evaluating alternate programs to accomplish the same goal, WHERE'S the timing data to support the notion that XOR EAX, EAX executes faster than MOV EAX, Zero???? In particular, where are the benchmarks published? Where's the url to a web site that explicitly defends this obsolete practice, employed religiously by legions of assembly language programmers, who no longer understand WHY Intel developed Boolean and bitwise operators in the first place! (to save precious memory!) Regards, tom Smile


Hi tom,
I'll just copy&paste some info from the Famous Pentium Optimization Manual
Quote:

MOV r,i:

Uops: 1
Microcode: 0
Latency: 0.5
Additional latency: 0.5 - 1
Reciprocal throughput: 0.25
Port: 0/1
Execution unit: alu0/1

Quote:

AND, OR, XOR r,r:

Uops: 1 (same)
Microcode: 0 (same)
Latency: 0.5 (same)
Additional latency: 0.5 - 1 (same)
Reciprocal throughput: 0.5 (MOV: 0.25)
Port: 0 (MOV: 0/1)
Execution unit: alu0 (MOV: alu0/1)

According to these timings, MOV can under some circumstances a bit quicker, but XOR r32,r32 has a special hardware support:
Quote:

15.10 Breaking dependencies
A common way of setting a register to zero is XOR EAX,EAX or SUB EBX,EBX. The P4
processor recognizes that these instructions are independent of the prior value of the
register. So any instruction that uses the new value of the register will not have to wait for
the value prior to the XOR or SUB instruction to be ready. The same applies to the PXOR
instruction with a 64-bit or 128-bit register, but not to any of the following instructions: XOR or
SUB with an 8-bit or 16-bit register, SBB, PANDN, PSUB, XORPS, XORPD, SUBPS, SUBPD,
FSUB.
The instructions XOR, SUB and PXOR are useful for breaking an unnecessary dependence.
On PPro, P2 and P3, you have to write MOV EAX,0 to break the dependence.

So, what is quicker then? Smile Wink


Last edited by MazeGen on 27 May 2005, 12:41; edited 1 time in total
Post 27 May 2005, 11:33
View user's profile Send private message Visit poster's website Reply with quote
smiddy



Joined: 31 Oct 2004
Posts: 557
smiddy 27 May 2005, 11:36
tom tobias wrote:
smiddy wrote:
I know for example XOR EAX,EAX has less overhead in drive space, but is it a faster operator than MOV EAX,0?



May I ask how you know this????


Hi Tom,

I wasn't trying to pick a fight here. I truely want to understand. I agree that readability is, especially after say six months of coding, you return to your original and you go, "WTF was that?" But since you asked, empirically I have boot sector code I wrote that reads in the FAT and root directory. Suffice it to say, I ran out of room initially for the 512 bytes. So I was looking for ways to reduce my on disk overhead. One way was to change all the MOV register, 0 to XOR register, register. A byte per each, around twelve of them. I subsequently was able to fit everything into one sector.

As for speed, well, that was my question, not a statement. Personally I prefer hard evidence too. I am an avid fan of lab work, it makes me happy... So, if we need benchmarks and such perhaps one day I'll get the opportunity to work on that and place them on the net for all to view. But alas I was merely asking the question to see what was available on information.

I'm getting the impression after re-reading your post that you may not have read the entire thread. I also get the impression that you really didn't read my question either... I'm sure I understand your perspective, I'm not certain you understand mine. Was this a knee-jerk reaction on your part? I would hate to continue posing questions and be slammed as being miss-guided, when I'm actually seeking guidance. What gives?
Post 27 May 2005, 11:36
View user's profile Send private message Reply with quote
smiddy



Joined: 31 Oct 2004
Posts: 557
smiddy 27 May 2005, 11:47
@MazeGen,

Thanks for posting this information and the link. I have a lot to learn...
Post 27 May 2005, 11:47
View user's profile Send private message Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias 27 May 2005, 20:13
Thanks MazeGen and Smiddy, good replies, both.
No, I was not seeking to belittle or argue with anyone.
The reference to Pentium optimization is quite interesting. I think a proper benchmark study would be useful to sort out the answer to the question of timing for XOR versus MOV. However, I am quite certain that few of the vast majority of participants of this forum will commence employing MOV EAX,zero, instead of XOR EAX, EAX, even if a properly designed benchmark study demonstrated approximate equivalence in timing for both instructions. For the most part, people have lots of excuses for continuing to write CODE, instead of creating programs. One of the main reasons for reluctance to adopt a more readable approach to writing programs is given by smiddy, quite clearly: He needed to save space--> that was paramount for him. This notion of saving memory by writing "tight code" is found in many places on the FASM forum. Not long ago, for example, one encountered a competition based upon using the fewest instructions to implement some task. Note, no one was interested in a competition about writing the CLEAREST, most readily transparent, easiest to comprehend PROGRAM, no, what was important was saving space, i.e. memory, as if this were 1955 again. Hmm. I must be from another planet.... Seems to me, I have heard that argument, about the importance of saving memory, as if it were some kind of ADVANTAGE, for more than four decades now.....I have no interest in short, obscure instructions that whittle 39 milliseconds of time off the execution of some equally obtuse muddle of CODE, and reside in a space smaller than a thimble. I like LONG, DETAILED, INFORMATIVE PROGRAMS, written so that someone can actually read the text and understand what the programmer seeks to accomplish. I LIKE SQUANDERING memory, I am happy when I encounter VERBOSITY with actual descriptions of a procedure, or an algorithm or a data structure. Saving n bytes in order to FIT some piece of code into a narrow little space is not my idea of progress. For sure, if one is really keen on saving space, the x86 instruction set is unsuitable......regards, tom
Post 27 May 2005, 20:13
View user's profile Send private message Reply with quote
madmatt



Joined: 07 Oct 2003
Posts: 1045
Location: Michigan, USA
madmatt 28 May 2005, 20:32
Fortunately , you can use the structured macros provided with the Fasm package and write some very clean and structured code without bloating the code size up, and without compromising the blinding speed that assembly provides. Thank you Privalov for a job well done!! Very Happy
MadMatt
Post 28 May 2005, 20:32
View user's profile Send private message Reply with quote
beppe85



Joined: 23 Oct 2004
Posts: 181
beppe85 29 May 2005, 22:57
I think it's hard to do timings, as thousands of "xor r32, r32" or "mov r32, 0" are far from being real instruction sequences. In this cases you need to look for pairings, IMO a pain. Only starting with P4 "xor" may perform better(just guessing), as "mov" is a longer encoding but break dependencies in 686. Sure they will perform equally well if they fit in cache.

And "xor r32, r32" is clear for me... You got it? "clear" hehehehe
Post 29 May 2005, 22:57
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.