flat assembler
Message board for the users of flat assembler.

Index > Main > Clear top byte of 64bit GPR?

Goto page Previous  1, 2, 3  Next
Author
Thread Post new topic Reply to topic
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
Thanks, but isn't
cmp reg,[const]

faster than

mov ecx,i
mov edi,memory
mov esi,const
rep cmpsb

??
Post 28 Aug 2009, 15:45
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
If you have many strings, it probably may not be because of caching and micro ops optimization (the CPU does that in loops), though I'm not sure. At least one argument in favor of the data structure (except for being easier to add a string): it's smaller, and that's measurable easily Cool

Anyway just thought I'd give you an alternative Smile

(if you have 2-3 strings then it's not worth it to do it the data structure-way obviously, but if you plan to have many, then it's the way to go IMO).

_________________
Previously known as The_Grey_Beast
Post 28 Aug 2009, 15:50
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
The rep won't work to do all the comparisons in it.. I would need one of them for each comparison.. wouldn't I? Confused
Post 28 Aug 2009, 15:55
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
I don't get your question. You mean you won't compare all of the strings with rep?

Of course not, if that's what you're saying. What you're doing is compare the string with each string from the data structure.

It's not like the other case doesn't do the same. (i.e you have multiple 'cmp' not just one!). Razz

so basically, if you don't consider the "rep cmpsb/cmpsd/whatever" a loop, then you have only one loop (through all the strings in the database).
Post 28 Aug 2009, 20:37
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
I just don't see how a rep cmpsb could ever be faster than a cmp Confused

I guess the only way to find out is to benchmark it though.


P.S. how do I use your irps thing? I don't understand. It looks like I'd have to place it down where the strings would be, and manually enter each string I want to use into it.. in which case I may as well write the db/dq/whatever myself..
Post 29 Aug 2009, 05:11
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Pirata Derek



Joined: 31 Oct 2008
Posts: 259
Location: Italy
Pirata Derek
The rep cmpsb permits you to execute a multiple serial compasion.
If you have to compare a huge amount of data, like 2 buffers comparsion, you should use these mnemonics.
Code:
; Compare buffer function

; INPUT:
;     PUSHD first buffer (offset)
;     PUSHD second buffer (offset)
;     PUSHD bytes to compare (dword)

; OUTPUT:
;     EAX = equal (boolean)

proc COMPARE_BUFFERS first,second,size
     mov ecx,[size]
     jecxz .equal
     mov esi,[first]
     mov edi,[second]
     cld
     rep cmpsb
     jne .different
     .equal: mov eax,TRUE
             ret
     .different: xor eax,eax
                 ret
endp    

These mnemonics automatically increase esi and edi and decrease ecx and execute a comparsion.
They help you to exclude these instructions in you code so CPU doesn't waste time to decode, fetch and execute them (less clocks, more fast)
Code:
inc esi
inc edi
dec ecx
...    
Post 29 Aug 2009, 13:55
View user's profile Send private message Send e-mail Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
Hmm. I benchmarked
Code:
       mov     rax,[variablememory]
        mov     edx,dword[variablememory]
   cmp     rax,[const12345678]
je       @f
  cmp     rax,[constabcdefgh]
je       @f
  shl     rax,8
       mov     cx,dx
       cmp     rax,[consthellooo]
je        @f
  cmp     rax,[constwooorld]
je        @f
  cmp     rax,[constthisisa]
je        @f
  cmp     edx,"test"
je      @f
  shl     edx,8
       cmp     edx,"bye"     shl (8*1)
je @f
  cmp     edx,":)="     shl (8*1)
je @f
  cmp     cx,":)"
je correct
@@:      int3
correct:    
vs
Code:
   mov     rdi,variablememory2
 mov     rsi,const212345678
  mov     rcx,8
       rep     cmpsb
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2abcdefgh
  mov     rcx,8
       rep     cmpsb
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2hellooo
   mov     rcx,7
       rep     cmpsb
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2wooorld
   mov     rcx,7
       rep     cmpsb
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2thisisa
   mov     rcx,7
       rep     cmpsb
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2test
      mov     rcx,4
       rep     cmpsb
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2bye
       mov     rcx,3
       rep     cmpsb
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2smilefrown
        mov     rcx,3
       rep     cmpsb
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2smile
     mov     rcx,2
       rep     cmpsb
je     correct2
@@:     int3
correct2:    


Data:
Code:
align 64
variablememory dq ":) stuff"
const12345678        dq "12345678"
constabcdefgh        dq "abcdefgh"
consthellooo dq "hellooo"  shl (8*1)
constwooorld       dq "wooorld"  shl (8*1)
constthisisa       dq "thisisa"  shl (8*1)

rd 4096

align 64
variablememory2     db ":) stuff"
const212345678               db "12345678"
const2abcdefgh               db "abcdefgh"
const2hellooo                db "hellooo"
const2wooorld         db "wooorld"
const2thisisa         db "thisisa"
const2test            db "test"
const2bye                db "bye"
const2smilefrown  db ":)="
const2smile               db ":)"    




On E8400, with 1000000 loops, the former is between one and two orders of magnitude faster.. (I should make a better benchmarker that is more consistent, I know, but it's always at least 30-something times faster, so.. good enough for me).




Edit: here's a more optimized version of the cmps one

Code:
      mov     rdi,variablememory2
 mov     rsi,const212345678
  cmpsq
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2abcdefgh
  cmpsq
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2hellooo
   mov     rcx,7
       rep     cmpsb
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2wooorld
   mov     rcx,7
       rep     cmpsb
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2thisisa
   mov     rcx,7
       rep     cmpsb
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2test
      cmpsd
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2bye
       mov     rcx,3
       rep     cmpsb
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2smilefrown
        mov     rcx,3
       rep     cmpsb
je     @f
  mov     rdi,variablememory2
 mov     rsi,const2smile
     cmpsw
je     correct2
@@:     int3
correct2:    
The cmp way is still always at least 23 times faster, though.
Post 29 Aug 2009, 14:39
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
Azu wrote:
I just don't see how a rep cmpsb could ever be faster than a cmp Confused
The cmpsb is for variable-sized strings. If you use simple cmp, you can avoid the REP altogether and use a simple cmpsq or cmpsd and that's it.

But you see, with rep, you can be much more flexible and even have 10000-bytes sized strings for example Smile

Azu wrote:
P.S. how do I use your irps thing? I don't understand. It looks like I'd have to place it down where the strings would be, and manually enter each string I want to use into it.. in which case I may as well write the db/dq/whatever myself..
I prefer to use FASM's macros to do it automatically but you can enter it yourself of course.

irps goes through the symbols and replaces the first parameter 'arg' with each symbol, then goes to the next one etc... if you want more than 1 symbol per parameter, use "irp" which goes through a comma-separated list. If you have commas in a parameter, escape it with '\,' so it's part of the parameter, not of irp.

This is basic FASM stuff. I love FASM for its flexibility Cool


EDIT: You did it the wrong way. You only need one piece of code for ALL strings. You loop through them all, not hard-code them in your code...

Here is an example for the variable-sized strings from my program (modified a bit in naming to account for general-purpose example):
Code:
  ; In this example, we start with:
;
;   ebx = points at the string to compare to (this isn't the string list, it is the string which you want to compare and see if it matches)
;   ecx is zeroed except for 'cl' which doesn't matter
;   edi points at the List of Strings you use (i.e the "data structure" I defined previously)

  @@:
    mov cl, [edi]        ; Get the byte before the string representing its length
    test cl, cl          ; See if it's the end of the list
    mov esi, ebx         ; Restore the compared string's offset (so we can compare again)
    jz NoMatch

    inc edi              ; Increase by 1 to the actual argument string (i.e skip the byte length prefix)
    repe cmpsb           ; Compare (this is the small way, for large strings, divide the string into chunks of 32 or 64-bits)
    lea edi, [edi+ecx+4] ; Skip the remaining chars (if any) + the JumpLabel (32-bit address).
    jnz @b

  ; Here we found a match, and 'edi' points to the next string in the String List
  ; so by doing a [edi-4] you get the Jump Label of the string that was matched
  ; and you jump there to do the string-specific code

  ; If the String database is long (has MANY strings), then you can optimize the address even if that adds more
  ; instructions, if you care about SIZE (and caching). For example you can use relative offsets from a given address
  ; since I doubt your program needs 4GB (32-bits), even more so 64-bits addresses. Smile

  ; By the way the above code handles variable-sized strings with a byte-prefix for their size, and an unlimited amount
  ; of strings without writing 1 more line of code!    

_________________
Previously known as The_Grey_Beast


Last edited by Borsuc on 29 Aug 2009, 15:55; edited 1 time in total
Post 29 Aug 2009, 15:46
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
Borsuc wrote:
Azu wrote:
I just don't see how a rep cmpsb could ever be faster than a cmp Confused
The cmpsb is for variable-sized strings. If you use simple cmp, you can avoid the REP altogether and use a simple cmpsq or cmpsd and that's it.

But you see, with rep, you can be much more flexible and even have 10000-bytes sized strings for example Smile
Well I benchmarked them and the cmp method was way faster even for strings of varying lengths, using the shl method from Loco. I don't need 10,000 byte long comparisons. 8 is enough for my case. And I can easily extend it to 16 with SSE.. or 32 when AVX comes out.. and keep in mind this is per compare. If I wanted to do really long strings (e.g. 10000), I could always just use multiple compares. So I really don't see the point of the string instructions.

Borsuc wrote:
Azu wrote:
P.S. how do I use your irps thing? I don't understand. It looks like I'd have to place it down where the strings would be, and manually enter each string I want to use into it.. in which case I may as well write the db/dq/whatever myself..
I prefer to use FASM's macros to do it automatically but you can enter it yourself of course.

irps goes through the symbols and replaces the first parameter 'arg' with each symbol, then goes to the next one etc... if you want more than 1 symbol per parameter, use "irp" which goes through a comma-separated list. If you have commas in a parameter, escape it with '\,' so it's part of the parameter, not of irp.

This is basic FASM stuff. I love FASM for its flexibility Cool
What I mean is, don't I have to manually put all of them into it? Like every time I use a constant, I'd still have to scroll all the way down there, and write that constant into it myself, wouldn't I? So what work is it saving? Confused


Last edited by Azu on 29 Aug 2009, 15:56; edited 1 time in total
Post 29 Aug 2009, 15:52
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
See my edit, I did it while you posted this. It is much more flexible method and is used for general-purpose variable-length strings, not hard-coded in your code.

You CAN make it NON-variable-length and remove the "rep" completely, AND the byte-prefix too, if you don't need it!


Last edited by Borsuc on 29 Aug 2009, 15:58; edited 1 time in total
Post 29 Aug 2009, 15:55
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
I'm 100% sure that one will be slower than all 3 of the ones I listed. There's just no way it could be faster.. benchmark them yourself if you want.
Post 29 Aug 2009, 15:57
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
Yes but it does use far less caching if you have a large amount of strings (= smaller). It also works with long strings (up to 255 bytes in size) Smile
Post 29 Aug 2009, 15:59
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
Nope :(

first run wrote:
01 loops
13617 clocks empty32bitFunc
216 clocks empty64bitFunc
13257 clocks cmps way (64bit)
693 clocks cmp way (64bit)
1737 clocks Borsuc's way (64bit)

02 loops
99 clocks empty32bitFunc
176 clocks empty64bitFunc
742 clocks cmps way (64bit)
238 clocks cmp way (64bit)
747 clocks Borsuc's way (64bit)

01 loops
72 clocks empty32bitFunc
198 clocks empty64bitFunc
522 clocks cmps way (64bit)
333 clocks cmp way (64bit)
963 clocks Borsuc's way (64bit)

02 loops
40 clocks empty32bitFunc
194 clocks empty64bitFunc
486 clocks cmps way (64bit)
234 clocks cmp way (64bit)
760 clocks Borsuc's way (64bit)

05 loops
13 clocks empty32bitFunc
160 clocks empty64bitFunc
459 clocks cmps way (64bit)
225 clocks cmp way (64bit)
706 clocks Borsuc's way (64bit)

1000 loops
21 clocks empty32bitFunc
173 clocks empty64bitFunc
489 clocks cmps way (64bit)
187 clocks cmp way (64bit)
738 clocks Borsuc's way (64bit)

100000 loops
0 clocks empty32bitFunc
152 clocks empty64bitFunc
446 clocks cmps way (64bit)
168 clocks cmp way (64bit)
690 clocks Borsuc's way (64bit)

1000000 loops
0 clocks empty32bitFunc
153 clocks empty64bitFunc
443 clocks cmps way (64bit)
169 clocks cmp way (64bit)
708 clocks Borsuc's way (64bit)

1000000 loops
181ms empty32bitFunc
222ms empty64bitFunc
305ms cmps way (64bit)
226ms cmp way (64bit)
371ms Borsuc's way (64bit)
second run wrote:
01 loops
6633 clocks empty32bitFunc
306 clocks empty64bitFunc
18324 clocks cmps way (64bit)
738 clocks cmp way (64bit)
1584 clocks Borsuc's way (64bit)

02 loops
40 clocks empty32bitFunc
270 clocks empty64bitFunc
482 clocks cmps way (64bit)
238 clocks cmp way (64bit)
738 clocks Borsuc's way (64bit)

01 loops
54 clocks empty32bitFunc
360 clocks empty64bitFunc
513 clocks cmps way (64bit)
351 clocks cmp way (64bit)
945 clocks Borsuc's way (64bit)

02 loops
36 clocks empty32bitFunc
194 clocks empty64bitFunc
477 clocks cmps way (64bit)
230 clocks cmp way (64bit)
747 clocks Borsuc's way (64bit)

05 loops
13 clocks empty32bitFunc
162 clocks empty64bitFunc
508 clocks cmps way (64bit)
227 clocks cmp way (64bit)
706 clocks Borsuc's way (64bit)

1000 loops
0 clocks empty32bitFunc
152 clocks empty64bitFunc
440 clocks cmps way (64bit)
167 clocks cmp way (64bit)
689 clocks Borsuc's way (64bit)

100000 loops
0 clocks empty32bitFunc
157 clocks empty64bitFunc
442 clocks cmps way (64bit)
180 clocks cmp way (64bit)
695 clocks Borsuc's way (64bit)

1000000 loops
0 clocks empty32bitFunc
154 clocks empty64bitFunc
441 clocks cmps way (64bit)
170 clocks cmp way (64bit)
707 clocks Borsuc's way (64bit)

1000000 loops
181ms empty32bitFunc
223ms empty64bitFunc
304ms cmps way (64bit)
226ms cmp way (64bit)
372ms Borsuc's way (64bit)



Here is the code I used for your way..
Code:
mov        rbx,variablememory3
mov      rdi,Strings
xor      rcx,rcx
@@:      mov     cl,[rdi]        ;Get the byte before the string representing its length
     test    cl,cl           ;See if it's the end of the list
   mov     rsi,rbx         ;Restore the compared string's offset (so we can compare again)
jz  @f
  inc     rdi             ;Increase by 1 to the actual argument string (i.e skip the byte length prefix)
      repe    cmpsb           ;Compare (this is the small way, for large strings, divide the string into chunks of 32 or 64-bits)
 mov     rax,[rdi+rcx]   ;the JumpLabel (64-bit address).
    lea     rdi,[rdi+rcx+8] ;Skip the remaining chars (if any) + the JumpLabel (64-bit address).
jne     @b
jmp       rax
@@:
wrong:    int3
correct3:    

and in data
Code:
align 64
variablememory3          db ":) stuff"
Strings:         db 8
                        db "12345678"
                     dq wrong
                    db 8
                        db "abcdefgh"
                     dq wrong
                    db 7
                        db "hellooo"
                      dq wrong
                    db 7
                        db "wooorld"
                      dq wrong
                    db 7
                        db "thisisa"
                      dq wrong
                    db 4
                        db "test"
                 dq wrong
                    db 3
                        db "bye"
                  dq wrong
                    db 3
                        db ":)="
                  dq wrong
                    db 2
                        db ":)"
                   dq correct3
db 0    
Post 29 Aug 2009, 16:36
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
Well may not be faster, but it sure is smaller for many strings, also more convenient/easier to expand (just add more strings in the data, that's all -- much easier with "irp" but I guess I'm the only one here who loves FASM's powerful capabilities Razz). (also the address doesn't need to be 64-bit, it can simply be a relative 32-bit value which you add a base afterwards (after the loop so speed impact is negligible)).

Did you use variable-length strings in the "cmp" way?
Because if not, of course mine is at a disadvantage! It handles arbitrarily sized strings. (255 max)

If all strings have same length you can get rid of the prefix byte and compare them without a 'rep', with a simple cmpsq or something like that.
Post 29 Aug 2009, 17:37
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
???
I posted the code for all 3 ways.. just look for yourself..
Post 30 Aug 2009, 04:08
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
oh that's what you mean by "cmps" way, sorry I misunderstood Laughing

unless you're in a very tight loop in which you need absolute speed, I still prefer my way (or the "cmps" way) because it's:

1) smaller for many strings (the most important, for me Very Happy), especially if you optimize the data structure (i.e use perhaps 3 bytes for the "jump offset", as a relative offset from a given value)
2) easier to expand
3) more elegant. (because FASM makes the data structure automatically in my case, with 'irp'/'irps' Cool).
Post 30 Aug 2009, 14:27
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
Meh.. I think cmp way is just so much faster that it's worth it.. and I read somewhere that in Nehalem, a cmp and je can be done in a single uop, and up to 5 of them in a single clock.. with the 64bit compares..
There's just no way to get close to that with the string opcodes lol.
Post 30 Aug 2009, 14:34
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
Who says loops aren't optimized? Usually they are unrolled by the CPU also.

Though, what's so much faster? It's just 3-4 times faster. For many strings it's not even 2 times faster.
Post 30 Aug 2009, 14:54
View user's profile Send private message Reply with quote
Azu



Joined: 16 Dec 2008
Posts: 1160
Azu
You left out a 0.. and I wasn't even testing on a nehalem, anyways..
Post 30 Aug 2009, 14:57
View user's profile Send private message Send e-mail AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
I don't know how to read your stats on speed measurement, for me this doesn't look like it has an extra 0:
Code:
99 clocks empty32bitFunc
176 clocks empty64bitFunc
742 clocks cmps way (64bit)
238 clocks cmp way (64bit)
747 clocks Borsuc's way (64bit)     
Post 30 Aug 2009, 14:59
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.