flat assembler
Message board for the users of flat assembler.
Index
> Main > Clear top byte of 64bit GPR? Goto page Previous 1, 2, 3 Next |
Author |
|
Azu 28 Aug 2009, 15:45
Thanks, but isn't
cmp reg,[const] faster than mov ecx,i mov edi,memory mov esi,const rep cmpsb ?? |
|||
28 Aug 2009, 15:45 |
|
Azu 28 Aug 2009, 15:55
The rep won't work to do all the comparisons in it.. I would need one of them for each comparison.. wouldn't I?
|
|||
28 Aug 2009, 15:55 |
|
Borsuc 28 Aug 2009, 20:37
I don't get your question. You mean you won't compare all of the strings with rep?
Of course not, if that's what you're saying. What you're doing is compare the string with each string from the data structure. It's not like the other case doesn't do the same. (i.e you have multiple 'cmp' not just one!). so basically, if you don't consider the "rep cmpsb/cmpsd/whatever" a loop, then you have only one loop (through all the strings in the database). |
|||
28 Aug 2009, 20:37 |
|
Azu 29 Aug 2009, 05:11
I just don't see how a rep cmpsb could ever be faster than a cmp
I guess the only way to find out is to benchmark it though. P.S. how do I use your irps thing? I don't understand. It looks like I'd have to place it down where the strings would be, and manually enter each string I want to use into it.. in which case I may as well write the db/dq/whatever myself.. |
|||
29 Aug 2009, 05:11 |
|
Pirata Derek 29 Aug 2009, 13:55
The rep cmpsb permits you to execute a multiple serial compasion.
If you have to compare a huge amount of data, like 2 buffers comparsion, you should use these mnemonics. Code: ; Compare buffer function ; INPUT: ; PUSHD first buffer (offset) ; PUSHD second buffer (offset) ; PUSHD bytes to compare (dword) ; OUTPUT: ; EAX = equal (boolean) proc COMPARE_BUFFERS first,second,size mov ecx,[size] jecxz .equal mov esi,[first] mov edi,[second] cld rep cmpsb jne .different .equal: mov eax,TRUE ret .different: xor eax,eax ret endp These mnemonics automatically increase esi and edi and decrease ecx and execute a comparsion. They help you to exclude these instructions in you code so CPU doesn't waste time to decode, fetch and execute them (less clocks, more fast) Code: inc esi inc edi dec ecx ... |
|||
29 Aug 2009, 13:55 |
|
Azu 29 Aug 2009, 14:39
Hmm. I benchmarked
Code: mov rax,[variablememory] mov edx,dword[variablememory] cmp rax,[const12345678] je @f cmp rax,[constabcdefgh] je @f shl rax,8 mov cx,dx cmp rax,[consthellooo] je @f cmp rax,[constwooorld] je @f cmp rax,[constthisisa] je @f cmp edx,"test" je @f shl edx,8 cmp edx,"bye" shl (8*1) je @f cmp edx,":)=" shl (8*1) je @f cmp cx,":)" je correct @@: int3 correct: Code: mov rdi,variablememory2 mov rsi,const212345678 mov rcx,8 rep cmpsb je @f mov rdi,variablememory2 mov rsi,const2abcdefgh mov rcx,8 rep cmpsb je @f mov rdi,variablememory2 mov rsi,const2hellooo mov rcx,7 rep cmpsb je @f mov rdi,variablememory2 mov rsi,const2wooorld mov rcx,7 rep cmpsb je @f mov rdi,variablememory2 mov rsi,const2thisisa mov rcx,7 rep cmpsb je @f mov rdi,variablememory2 mov rsi,const2test mov rcx,4 rep cmpsb je @f mov rdi,variablememory2 mov rsi,const2bye mov rcx,3 rep cmpsb je @f mov rdi,variablememory2 mov rsi,const2smilefrown mov rcx,3 rep cmpsb je @f mov rdi,variablememory2 mov rsi,const2smile mov rcx,2 rep cmpsb je correct2 @@: int3 correct2: Data: Code: align 64 variablememory dq ":) stuff" const12345678 dq "12345678" constabcdefgh dq "abcdefgh" consthellooo dq "hellooo" shl (8*1) constwooorld dq "wooorld" shl (8*1) constthisisa dq "thisisa" shl (8*1) rd 4096 align 64 variablememory2 db ":) stuff" const212345678 db "12345678" const2abcdefgh db "abcdefgh" const2hellooo db "hellooo" const2wooorld db "wooorld" const2thisisa db "thisisa" const2test db "test" const2bye db "bye" const2smilefrown db ":)=" const2smile db ":)" On E8400, with 1000000 loops, the former is between one and two orders of magnitude faster.. (I should make a better benchmarker that is more consistent, I know, but it's always at least 30-something times faster, so.. good enough for me). Edit: here's a more optimized version of the cmps one Code: mov rdi,variablememory2 mov rsi,const212345678 cmpsq je @f mov rdi,variablememory2 mov rsi,const2abcdefgh cmpsq je @f mov rdi,variablememory2 mov rsi,const2hellooo mov rcx,7 rep cmpsb je @f mov rdi,variablememory2 mov rsi,const2wooorld mov rcx,7 rep cmpsb je @f mov rdi,variablememory2 mov rsi,const2thisisa mov rcx,7 rep cmpsb je @f mov rdi,variablememory2 mov rsi,const2test cmpsd je @f mov rdi,variablememory2 mov rsi,const2bye mov rcx,3 rep cmpsb je @f mov rdi,variablememory2 mov rsi,const2smilefrown mov rcx,3 rep cmpsb je @f mov rdi,variablememory2 mov rsi,const2smile cmpsw je correct2 @@: int3 correct2: |
|||
29 Aug 2009, 14:39 |
|
Borsuc 29 Aug 2009, 15:46
Azu wrote: I just don't see how a rep cmpsb could ever be faster than a cmp But you see, with rep, you can be much more flexible and even have 10000-bytes sized strings for example Azu wrote: P.S. how do I use your irps thing? I don't understand. It looks like I'd have to place it down where the strings would be, and manually enter each string I want to use into it.. in which case I may as well write the db/dq/whatever myself.. irps goes through the symbols and replaces the first parameter 'arg' with each symbol, then goes to the next one etc... if you want more than 1 symbol per parameter, use "irp" which goes through a comma-separated list. If you have commas in a parameter, escape it with '\,' so it's part of the parameter, not of irp. This is basic FASM stuff. I love FASM for its flexibility EDIT: You did it the wrong way. You only need one piece of code for ALL strings. You loop through them all, not hard-code them in your code... Here is an example for the variable-sized strings from my program (modified a bit in naming to account for general-purpose example): Code: ; In this example, we start with: ; ; ebx = points at the string to compare to (this isn't the string list, it is the string which you want to compare and see if it matches) ; ecx is zeroed except for 'cl' which doesn't matter ; edi points at the List of Strings you use (i.e the "data structure" I defined previously) @@: mov cl, [edi] ; Get the byte before the string representing its length test cl, cl ; See if it's the end of the list mov esi, ebx ; Restore the compared string's offset (so we can compare again) jz NoMatch inc edi ; Increase by 1 to the actual argument string (i.e skip the byte length prefix) repe cmpsb ; Compare (this is the small way, for large strings, divide the string into chunks of 32 or 64-bits) lea edi, [edi+ecx+4] ; Skip the remaining chars (if any) + the JumpLabel (32-bit address). jnz @b ; Here we found a match, and 'edi' points to the next string in the String List ; so by doing a [edi-4] you get the Jump Label of the string that was matched ; and you jump there to do the string-specific code ; If the String database is long (has MANY strings), then you can optimize the address even if that adds more ; instructions, if you care about SIZE (and caching). For example you can use relative offsets from a given address ; since I doubt your program needs 4GB (32-bits), even more so 64-bits addresses. ; By the way the above code handles variable-sized strings with a byte-prefix for their size, and an unlimited amount ; of strings without writing 1 more line of code! _________________ Previously known as The_Grey_Beast Last edited by Borsuc on 29 Aug 2009, 15:55; edited 1 time in total |
|||
29 Aug 2009, 15:46 |
|
Azu 29 Aug 2009, 15:52
Borsuc wrote:
Borsuc wrote:
Last edited by Azu on 29 Aug 2009, 15:56; edited 1 time in total |
|||
29 Aug 2009, 15:52 |
|
Borsuc 29 Aug 2009, 15:55
See my edit, I did it while you posted this. It is much more flexible method and is used for general-purpose variable-length strings, not hard-coded in your code.
You CAN make it NON-variable-length and remove the "rep" completely, AND the byte-prefix too, if you don't need it! Last edited by Borsuc on 29 Aug 2009, 15:58; edited 1 time in total |
|||
29 Aug 2009, 15:55 |
|
Azu 29 Aug 2009, 15:57
I'm 100% sure that one will be slower than all 3 of the ones I listed. There's just no way it could be faster.. benchmark them yourself if you want.
|
|||
29 Aug 2009, 15:57 |
|
Borsuc 29 Aug 2009, 15:59
Yes but it does use far less caching if you have a large amount of strings (= smaller). It also works with long strings (up to 255 bytes in size)
|
|||
29 Aug 2009, 15:59 |
|
Azu 29 Aug 2009, 16:36
Nope :(
first run wrote: 01 loops second run wrote: 01 loops Here is the code I used for your way.. Code: mov rbx,variablememory3 mov rdi,Strings xor rcx,rcx @@: mov cl,[rdi] ;Get the byte before the string representing its length test cl,cl ;See if it's the end of the list mov rsi,rbx ;Restore the compared string's offset (so we can compare again) jz @f inc rdi ;Increase by 1 to the actual argument string (i.e skip the byte length prefix) repe cmpsb ;Compare (this is the small way, for large strings, divide the string into chunks of 32 or 64-bits) mov rax,[rdi+rcx] ;the JumpLabel (64-bit address). lea rdi,[rdi+rcx+8] ;Skip the remaining chars (if any) + the JumpLabel (64-bit address). jne @b jmp rax @@: wrong: int3 correct3: and in data Code: align 64 variablememory3 db ":) stuff" Strings: db 8 db "12345678" dq wrong db 8 db "abcdefgh" dq wrong db 7 db "hellooo" dq wrong db 7 db "wooorld" dq wrong db 7 db "thisisa" dq wrong db 4 db "test" dq wrong db 3 db "bye" dq wrong db 3 db ":)=" dq wrong db 2 db ":)" dq correct3 db 0 |
|||
29 Aug 2009, 16:36 |
|
Borsuc 29 Aug 2009, 17:37
Well may not be faster, but it sure is smaller for many strings, also more convenient/easier to expand (just add more strings in the data, that's all -- much easier with "irp" but I guess I'm the only one here who loves FASM's powerful capabilities ). (also the address doesn't need to be 64-bit, it can simply be a relative 32-bit value which you add a base afterwards (after the loop so speed impact is negligible)).
Did you use variable-length strings in the "cmp" way? Because if not, of course mine is at a disadvantage! It handles arbitrarily sized strings. (255 max) If all strings have same length you can get rid of the prefix byte and compare them without a 'rep', with a simple cmpsq or something like that. |
|||
29 Aug 2009, 17:37 |
|
Azu 30 Aug 2009, 04:08
???
I posted the code for all 3 ways.. just look for yourself.. |
|||
30 Aug 2009, 04:08 |
|
Borsuc 30 Aug 2009, 14:27
oh that's what you mean by "cmps" way, sorry I misunderstood
unless you're in a very tight loop in which you need absolute speed, I still prefer my way (or the "cmps" way) because it's: 1) smaller for many strings (the most important, for me ), especially if you optimize the data structure (i.e use perhaps 3 bytes for the "jump offset", as a relative offset from a given value) 2) easier to expand 3) more elegant. (because FASM makes the data structure automatically in my case, with 'irp'/'irps' ). |
|||
30 Aug 2009, 14:27 |
|
Azu 30 Aug 2009, 14:34
Meh.. I think cmp way is just so much faster that it's worth it.. and I read somewhere that in Nehalem, a cmp and je can be done in a single uop, and up to 5 of them in a single clock.. with the 64bit compares..
There's just no way to get close to that with the string opcodes lol. |
|||
30 Aug 2009, 14:34 |
|
Borsuc 30 Aug 2009, 14:54
Who says loops aren't optimized? Usually they are unrolled by the CPU also.
Though, what's so much faster? It's just 3-4 times faster. For many strings it's not even 2 times faster. |
|||
30 Aug 2009, 14:54 |
|
Azu 30 Aug 2009, 14:57
You left out a 0.. and I wasn't even testing on a nehalem, anyways..
|
|||
30 Aug 2009, 14:57 |
|
Borsuc 30 Aug 2009, 14:59
I don't know how to read your stats on speed measurement, for me this doesn't look like it has an extra 0:
Code: 99 clocks empty32bitFunc 176 clocks empty64bitFunc 742 clocks cmps way (64bit) 238 clocks cmp way (64bit) 747 clocks Borsuc's way (64bit) |
|||
30 Aug 2009, 14:59 |
|
Goto page Previous 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.