flat assembler
Message board for the users of flat assembler.

Index > Main > Why is this code _so_ much slower?

Author
Thread Post new topic Reply to topic
Patrick_



Joined: 11 Mar 2006
Posts: 53
Location: 127.0.0.1
Patrick_
This is really weird. Can someone tell me why this code:

Code:
        push edi esi

        mov edi, eax
        mov esi, ebx
        cld      ;increment string registers

 .comparison:
        cmpsb   ;compare byte at edi with byte at esi
        jne .invalid
        loop .comparison        ;decrement ecx, loop until ecx == 0
 .valid:
        xor eax, eax
        jmp .return
 .invalid:
        mov eax, 1
 .return:
        pop esi edi
        ret    


... takes 3.692s to do 100000000 iterations with my "driver", and this code:

Code:
        push edi
        push edx
        xor edi, edi    ;position in string
        xor edx, edx

 .loop:
        cmp edi, ecx    ;reached max bytes to compare?
        je .equal       

        cmp byte [ebx+edi], 0   ;have we reached end of string2?
        je .found_end_s2
        mov dl, [eax+edi]       
        cmp dl, 0               
        je .found_end_s1
        cmp dl, byte [ebx+edi]  
        jne .not_equal          
        inc edi                 
        jmp .loop

 .found_end_s2:
        cmp byte [eax+edi], 0        ;is s1 also 0?
        jne .not_equal          ;no, so not equal
        jmp .equal

 .found_end_s1:
        cmp byte [ebx+edi], 0        ;is s2 also 0?
        je .equal

 .not_equal:
        pop edx
        pop edi
        mov eax, 1
        ret
 .equal:
        pop edx
        pop edi
        xor eax, eax
        ret    


... takes only 1.591s to complete 100000000 on the same two strings? I'm quite surprised actually, that my original code is so much faster. I just rewrote my memset function to use "rep" and "stosb" to set memory to a certain byte, and it gave me a 70% speedup. So why is this code giving me such a slowdown? I wasn't surprised at the speedup of my memset function, because I know I used special optimized instructions instead of original code to set bytes of memory, but I'm puzzled as to why the first example is so much slower than the second.

Confused
Post 26 Apr 2006, 18:49
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
String opcodes are slow. You can almost always write faster code without using specialized opcodes like stos, cmps*, and rep.
cmp dl, 0
je .found_end_s1
replace that with test dl,dl jz .found_end_s1 and you may get a little speed boost.

Code:
eax = al = byte to set to 000000XXh
ebx = length (how many bytes to set) 
esi = address to start at
memset:
mov edx,01010101h
mul edx ;;eax = XX XX XX XXh
.LPqword:
sub ebx,8
js .skip
mov dword[esi+ebx+4],eax
mov dword[esi+ebx],eax
jmp .LPqword
.skip:
add ebx, 8
.finishup:
dec ebx
js .end
mov byte[esi+ebx],al
jmp .finishup
.end:
xor eax,eax
ret 0
    
Post 26 Apr 2006, 19:30
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Patrick_



Joined: 11 Mar 2006
Posts: 53
Location: 127.0.0.1
Patrick_
Thanks, that's nice to know. Smile
Post 26 Apr 2006, 20:37
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Patrick_, did you check what happens if you change loop .comparison with dec ecx/jnz .comparison? If I remember well loop is worst than the equivalent instruction pair.
Post 26 Apr 2006, 20:45
View user's profile Send private message Reply with quote
Patrick_



Joined: 11 Mar 2006
Posts: 53
Location: 127.0.0.1
Patrick_
locodelassembly wrote:
Patrick_, did you check what happens if you change loop .comparison with dec ecx/jnz .comparison? If I remember well loop is worst than the equivalent instruction pair.


Actually, yes I did, as I remember loop being very slow, also; however, no change whatsoever.
Post 27 Apr 2006, 00:27
View user's profile Send private message Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
AMD says you should use loop, so it's better anyway on AMD Smile
Post 01 May 2006, 11:00
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
The_Grey_Beast wrote:
AMD says you should use loop, so it's better anyway on AMD Smile


It's not worth it considering the massive speed hit you'll take on Intel architecture, though.

As for the string instructions, "rep movsd" and "rep stosd" are relatively fast, while cmps* is slow. Even movsd/stosd can be beaten though, especially if you use the non-cached writes... you have to ask yourself, though, what minimum CPU you want to require Smile

_________________
Image - carpe noctem
Post 01 May 2006, 11:12
View user's profile Send private message Visit poster's website Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
Yep "rep movsd" are especially useful when doing "inline" things, that is without calling functions.

If I recall correctly loop is the same as "dec ecx/je" on Pentium, but it's faster on AMD. Why not use it? (ok, correct me if i'm wrong) and it's also smaller, if you strive for size
Post 01 May 2006, 13:12
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.