flat assembler
Message board for the users of flat assembler.
Index
> Main > Fast Way to Get String Length (StrLen) Goto page 1, 2 Next |
Author |
|
revolution 18 Nov 2004, 08:33
You need to optimize the instructions you use to take full advantage ot the later processors like P4.
INC --> ADD OR --> TEST And remove the difficult to predict branches. Agner Fogg has published a paper about this topic. Basically he suggests this: Code: ;edx=string start lea ecx,[edx+4] ;load and increment pointer mov ebx,[edx] ;read first 4 bytes lea edx,[edx+7] ;pointer+7 used in the end ._1: lea eax,[ebx-01010101h] ;subtract 1 from each byte xor ebx,-1 ;invert all bytes and eax,ebx ;and these two mov ebx,[ecx] ;read next 4 bytes add ecx,4 ;increment pointer and eax,80808080h ;test all sign bits jz ._1 ;no zero bytes, continue loop test eax,00008080h ;test first two bytes jnz ._2 shr eax,16 ;not in the first 2 bytes add ecx,2 ._2: shl al,1 ;use carry flag to avoid a branch sbb ecx,edx ;compute length lea edx,[edx-7] ;restore pointer |
|||
18 Nov 2004, 08:33 |
|
Madis731 18 Nov 2004, 13:52
You really ARE a revolution . A man worthy of his name.
Tight and fast - gotta luv those lea's. |
|||
18 Nov 2004, 13:52 |
|
Matrix 19 Nov 2004, 03:17
hy revolution,
its cool |
|||
19 Nov 2004, 03:17 |
|
Madis731 19 Nov 2004, 15:28
I re-read the title (FAST) so short C-compiler code doesn't help much
Code: mov edx,ebx cmp byte ptr [ebx],0 je l2 l1: mov ah,[ebx+1] ; U inc ebx ; V test ah,ah ; U jne l1 ; V +1brt l2: sub ebx,edx ;Pre PIII optimized I think so I found this DWORD scanning algorithm: Code: ; by Paul Hsieh lea ecx,[ebx-1] l1: inc ecx test ecx,3 jz l2 cmp byte ptr [ecx],0 jne l1 jmp l6 l2: mov eax,[ecx] ; U add ecx,4 ; V test al,al ; U jz l5 ; V test ah,ah ; U jz l4 ; V test eax,0ff0000h ; U jz l3 ; V test eax,0ff000000h ; U jnz l2 ; V +1brt inc ecx l3: inc ecx l4: inc ecx l5: sub ecx,4 l6: sub ecx,ebx OK, I won't post all the code here Go and check out http://www.azillionmonkeys.com/qed/asmexample.html for more code (it under 5.) |
|||
19 Nov 2004, 15:28 |
|
Matrix 19 Nov 2004, 16:51
hy Madis731
your code seems familiar to me from somewhere (its nearly the same i have posted) |
|||
19 Nov 2004, 16:51 |
|
Madis731 19 Nov 2004, 19:15
Oh, sorry, I can see sarcasm in your post :$
Now I see you indeed have posted the same code - I didn't look it that closely before. How come it's the same solution with different registers ^o) even with different types (WORD/DWORD) of register? |
|||
19 Nov 2004, 19:15 |
|
Matrix 19 Nov 2004, 20:47
well exactly i wrote that function i posted,
i wrote it just before i posted it and its in my style, they are just familiar, same structure, but i see the system in my codes, in other's codes there used to be difference - however i'm on doing some class codes too. my codes usually are what i design them to be, depennding the purpose of code, i guess some of you are doing familiar. in some cases two codes can be the same without sarcasm. especially when the code is short ( not a 10 MB source ) |
|||
19 Nov 2004, 20:47 |
|
Matrix 21 Nov 2004, 11:41
hy all,
hy revolution you avoided a branch, (however it didn't help much) and i have removed some unneccessary instructions (StrLen32M) and after dword aligning it could be faster: (StrLen32Z) Code: macro bdisplay .__stringvar002 { mov si,.__stringvar002 mov bl,10 call bwritestring } macro dw_strlength .__stringvar001 {dw ($-.__stringvar001-1) and $ffff} org 256 mov si,msg call StrLen32R ; input: esi=offset to string ; output: ecx=length (w/o 0) esi=pointer to end of string ; destroys: eflags push cx bdisplay lengthtext pop cx mov ax,cx call bwriteint call bendline mov edx,msg call StrLen32 ; input: esi=offset to string ; output: ecx=length (w/o 0) esi=pointer to end of string ; destroys: eflags push cx bdisplay lengthtext pop cx mov ax,cx call bwriteint call bendline mov esi,msg call StrLen32M ; input: esi=offset to string ; output: ecx=length (w/o 0) esi=pointer to end of string ; destroys: eflags push cx bdisplay lengthtext pop cx mov ax,cx call bwriteint call bendline mov si,msg call StrLen32Z ; input: esi=offset to string ; output: ecx=length (w/o 0) esi=pointer to end of string ; destroys: eflags push cx bdisplay lengthtext pop cx mov ax,cx call bwriteint call bendline xor ah,ah int 16h ret msg: db 'this is a test string.',0 ;msg: db '123',0 dw_strlength msg ;length_msg dw endmsg-msg-1 lengthtext: db 'Test string length:',0 StrLen32R: ; input: esi=offset to string ; output: ecx=length (w/o 0) esi=pointer to end of string ; destroys: eflags mov cx,si .byte: test si,11b jz .dword lodsb or al,al jnz .byte jmp .done .dword: lodsd or al,al jz .lastdd or ah,ah jz .thisdd test eax,0ff0000h jz .thirdd test eax,0ff000000h jnz .dword inc si .thirdd:inc si .thisdd:inc si .lastdd:sub si,3 .done: dec si sub cx,si neg cx ret StrLen32: ;edx=string start, returns: ecx=length lea ecx,[edx+4] ;load and increment pointer mov ebx,[edx] ;read first 4 bytes lea edx,[edx+7] ;pointer+7 used in the end ._1: lea eax,[ebx-01010101h] ;subtract 1 from each byte xor ebx,-1 ;invert all bytes and eax,ebx ;and these two mov ebx,[ecx] ;read next 4 bytes add ecx,4 ;increment pointer and eax,80808080h ;test all sign bits jz ._1 ;no zero bytes, continue loop test eax,00008080h ;test first two bytes jnz ._2 shr eax,16 ;not in the first 2 bytes add ecx,2 ._2: shl al,1 ;use carry flag to avoid a branch sbb ecx,edx ;compute length lea edx,[edx-7] ;restore pointer ret StrLen32M: ;esi=string start, returns: ecx=length mov ecx,esi ._1:lodsd lea ebx,[eax-$01010101] xor eax,-1 and ebx,eax and ebx,$80808080 jz ._1 add ecx,4 .lp:shr ebx,8 jc .sb dec ecx jnz .lp .sb:neg ecx add ecx,esi ret StrLen32Z: ; input: esi=offset to string ; output: ecx=length (w/o 0) esi=pointer to end of string ; destroys: eflags mov eax,esi and eax,11b jz .dword mov ecx,4 sub ecx,eax .byte: repnz scasb or cx,cx jz .dword mov ecx,esi jmp .sb .dword: mov ecx,esi .loop: lodsd lea ebx,[eax-$01010101] xor eax,-1 and ebx,eax and ebx,$80808080 jz .loop add ecx,4 .lp:shr ebx,8 jc .sb dec ecx jnz .lp .sb:neg ecx add ecx,esi ret bwriteint: ; AX = signed number, BL = base or ax,ax jns .printit ; if it's not negative, we don't print any signs push ax mov ax,$e2d int 10h pop ax neg ax ; do the number positive .printit: push ax bx cx dx and bx,$ff cmp bl,2 ; base can't be less than 2 jge .start mov bl,10 ; using bx = 10 instead .start: xor cx,cx ; cx = 0 .new: xor dx,dx ; dx = 0 div bx ; number / base push dx ; push the remainder inc cx ; increase the "digit-count" or ax,ax ; if the quotient still is not 0, do it once more jnz .new .loop: pop ax ; pop the remainder cmp al,10 sbb al,69h das mov ah,$e int 10h loop .loop pop dx cx bx ax ret bendline: ; ends the line ( adds y loc = 1 x loc = 0 ) push ax bx ; returns: AX = 0xE0A ; BX = 7 mov bx,7 mov ax,$E0D int 10h mov al,$A int 10h pop bx ax ret bwritestring: ; Writes a 0 terminated string to screen ( string is at ds:si ) push ax bx mov bx,7 ; use video page 0, normal white mov ah,$e localloop: lodsb or al,al jnz next_char pop bx ax ret next_char: int 10h jmp localloop ps.: its a good feeling to know what am i doing Madis731 i like LEA too, but i don't use it when i don't need it |
|||
21 Nov 2004, 11:41 |
|
Octavio 21 Nov 2004, 17:10
This is not the kind of things that needs optimization,but just for fun here is my version, the main loop is similar to the one used by Agner fog ,but
the pointer is aligned ,and from my tests if the code is in the adecuate position it can process 4 bytes in 3 cpu clocks. use32 strlen: ;uses the 'C' calling convention the pointer is pushed on the stack ;registers are saved and result is returned in eax push ecx push esi mov esi,[esp+4+8] mov eax,[esi] and esi,-4 .l1: add esi,4 lea ecx,[eax-1010101h] not eax and ecx,eax mov eax,[esi] and ecx,80808080h jz .l1 bsf eax,ecx sub esi,[esp+4+8] sar eax,3 lea eax,[eax+esi-4] pop esi pop ecx ret ;replace by 'ret 4' to use the Pascal calling convention |
|||
21 Nov 2004, 17:10 |
|
roticv 21 Nov 2004, 17:20
Matrix what kind of strings are you interested in? Long strings?
|
|||
21 Nov 2004, 17:20 |
|
Matrix 21 Nov 2004, 17:39
thnx for all for ideas and everything,
for me this thread is complete with my StrLen32Z version, of course if someone wanna continue optimizing, feel free to do so. looks like MMX is next step, but for a 0 terminated string? well i don't see any use of it yet. |
|||
21 Nov 2004, 17:39 |
|
Madis731 21 Nov 2004, 17:57
Didn't you look under the link I gave.
It already has an MMX solution! Oh, I guess I have to post it anyways. You people are so lazy Code: ; MMX version by Ryan Mack ; Roughly 13 + 3n + BRANCH clocks on a P-II const unsigned __int64 STRINGTBL[8] = {0, 0xff, 0xffff, 0xffffff, 0xffffffff, 0xffffffffff, 0xffffffffffff, 0xffffffffffffff} /* ... */ pxor mm1, mm1 mov ecx, eax mov edx, eax and ecx, -8 and eax, 7 movq mm0, [ecx] por mm0, [STRINGTBL+eax*8] MAIN: add ecx, 8 pcmpeqb mm0, mm1 packsswb mm0, mm0 movd eax, mm0 movq mm0, [ecx] test eax, eax jz MAIN bsf eax, eax shr eax, 2 lea ecx, [ecx+eax-8] sub ecx, edx |
|||
21 Nov 2004, 17:57 |
|
Matrix 21 Nov 2004, 18:01
Madis731 wrote: Didn't you look under the link I gave. you're so right |
|||
21 Nov 2004, 18:01 |
|
beppe85 03 Feb 2005, 12:24
Care should be taken in the MMX version if the string spans a page(8-byte access may cause a page fault).
Not that I like this kind of strings(they're overall slow per nature), but will be a big win "align 8" string allocations. |
|||
03 Feb 2005, 12:24 |
|
S.T.A.S. 04 Feb 2005, 10:11
Nope, MMX data may be located at any address, the alignment must be done for data accesed by some SSE instruction like MOVAPS.
However, data shouldn't straddle a cache-line boundary because this gives speed penalty. |
|||
04 Feb 2005, 10:11 |
|
beppe85 04 Feb 2005, 13:36
S.T.A.S., I'm refering on general memory access. What about if four bytes fall into your valid page, and next 4 into an unmapped page? AV, eh?
|
|||
04 Feb 2005, 13:36 |
|
Madis731 04 Feb 2005, 14:57
This is a rare occasion but might happen - like when you reserve 1007 Bytes in the end of a page and pass 126QW to your MMX routine. I haven't tryed but maybe MMX fills unknown values with zeroes or random
|
|||
04 Feb 2005, 14:57 |
|
beppe85 04 Feb 2005, 18:41
The solution is so simple, align your C-strings on qword boundary, or better, forget C(reep)-strings.
|
|||
04 Feb 2005, 18:41 |
|
beppe85 04 Feb 2005, 20:05
Madis731 wrote: I haven't tryed but maybe MMX fills unknown values with zeroes or random That would be silly. MMX gives the same feedback. Code: invoke VirtualAlloc, 0, 8192, MEM_RESERVE, PAGE_NOACCESS invoke VirtualAlloc, eax, 4096, MEM_COMMIT, PAGE_READWRITE add eax, 4096-4 movq mm0, [eax] _________________ "I assemble, therefore I am" If you got some spare time, visit my blog: http://www.beppe.theblog.com.br/ and sign my guestmap |
|||
04 Feb 2005, 20:05 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.