flat assembler
Message board for the users of flat assembler.

Index > Main > fast strlen

Goto page 1, 2, 3  Next
Author
Thread Post new topic Reply to topic
lovefasm



Joined: 17 Jun 2007
Posts: 22
lovefasm
;注:高效StrLen函数汇编代码实现
;经过在 Visual C++ 2005 测试,速度远超越了其所带C库的 strlen 函数,所用时间比 8/13
;未检测参数有效性,建议用SEH __try/__except(XXX)检测

;编译工具 fasm 1.67.23
;完成时间 2007.10.12 晚

format ms coff

public StrLen as '_StrLen@4'

;int __stdcall StrLen(const char* str) ;C调用原形

StrLen:
.string equ dword [esp+4]

mov ecx,.string ; ecx = .string
test ecx,3 ; 测试string 首地址是否4字节对齐(Visual C++ 默认生成的其实都是4字节对齐)
je .main_loop

.str_misaligned:
; 拷贝对齐前的1~3个字节
mov al,[ecx]
inc ecx
test al,al
je .byte_3
test ecx,3
jne .str_misaligned ; 对齐

.main_loop:
mov eax,[ecx] ; 读取 4 bytes
add ecx,4

test al,al ; is it byte 0
je .byte_0 ; 多计算了4 byte

test ah,ah ; is it byte 1
je .byte_1 ; 多计算了3 byte

shr eax,16

test al,al ; is it byte 2
je .byte_2 ; 多计算了2 byte

test ah,ah ; is it byte 3
je .byte_3 ; 多计算了1 byte

jmp .main_loop ; 准备读取下一个 4 bytes

;减去多计算的字节数
.byte_3:
dec ecx
jmp .result
.byte_2:
sub ecx,2
jmp .result
.byte_1:
sub ecx,3
jmp .result
.byte_0:
sub ecx,4

.result:
mov eax,ecx
sub eax,.string
ret 4


Description:
Download
Filename: FastStrLen.rar
Filesize: 1016 Bytes
Downloaded: 303 Time(s)

Post 12 Oct 2007, 15:09
View user's profile Send private message Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
not as fast though... there are some ways to test entire dword with single comparing, after doing som bit magic with the dword
Post 12 Oct 2007, 15:23
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
Mac2004



Joined: 15 Dec 2003
Posts: 313
Mac2004
lovefasm: Could you please use code tags? They improve readability of your code quite bit.

regards,
Mac2004
Post 12 Oct 2007, 18:03
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Yea this form has a ton of strlen snippets if you do a search.

Generic library type versions will never be as fast as versions specifically made for the program.

IE: If I have a buffer that is holding ASCIIZ bytes
- I know I created the buffer using VirtualAlloc so its 16byte aligned (at least)
- I know I created it with at least 32 bytes of extra padding at the end.
SO, I'll use an SSE string len that checks 32 bytes at a time (every iteration).
Code:
align 16
strlen:  
;;string len with aligned and padded buffer 
;;[esp+4] string pointer 
        mov     eax,[esp+4] 
        push    ebx 
        test    eax,eax 
        jz      .fail 
        pxor    xmm0,xmm0 
        mov     ebx,eax 
;AMDPad16 ;;NOP sequence 16 byte align padding macro
.len: 
;;load 32 bytes of the string 
        movdqa  xmm2,[eax+16] 
        movdqa  xmm1,[eax] 
;;search for a null character 
        pcmpeqb xmm2,xmm0 
        pcmpeqb xmm1,xmm0 
;;move results to 32bit registers 
        pmovmskb edx,xmm2 
        pmovmskb ecx,xmm1 
;;combine the bit flagged results and increment the loop 
        shl     edx,16 
        add     eax,32 
        or      ecx,edx 
        jz      .len 
;;calculate the length (end-start + byte_offset - 32) 
        sub     eax,ebx 
        bsf     ecx,ecx 
        lea     eax,[eax+ecx-32] 
        pop     ebx 
        ret     4 
.fail: 
        xor     eax,eax 
        pop     ebx 
        ret     4 
    

This is probably close to the fastest way to get a strlen, but again it's only valid if the -'ed conditions are met.
Post 12 Oct 2007, 23:21
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
SomeoneNew



Joined: 12 Aug 2006
Posts: 54
SomeoneNew
Either that or storing the string-length after length-changing operations Smile

_________________
Im new, sorry if I bothered with any stupid question Smile
Post 29 Dec 2007, 14:45
View user's profile Send private message Reply with quote
Vasilev Vjacheslav



Joined: 11 Aug 2004
Posts: 392
Vasilev Vjacheslav
what about Agner Fog strlen?
Post 18 Jan 2008, 13:45
View user's profile Send private message Reply with quote
asmhack



Joined: 01 Feb 2008
Posts: 431
asmhack
i use the below code to get null-terminated string length
small and fast (on 'short' strings)..

Code:
  @@strlen:
           mov   ecx,[esp+$4]
           or    eax,-$1
  @@:
           lea   eax,[eax+$1]
           cmp   byte[ecx+eax],$0
           jnz   @b
           ret
    
Post 05 Feb 2008, 23:39
View user's profile Send private message Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP
Yeah, that's pretty much what I use, C compilers have almost same output.
Post 06 Feb 2008, 00:10
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4237
Location: 2018
edfed
asmhack:
lea eax,[eax+1] ??????

inc eax is better...
Post 06 Feb 2008, 01:25
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP
edfed wrote:
asmhack:
lea eax,[eax+1] ??????

inc eax is better...

lol
Post 06 Feb 2008, 01:26
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4237
Location: 2018
edfed
inc is executable in all pipelines each cycle, for a pentium I (2 pipes) we can execute 2 inc in one clock.
i don't know lea timings, but it's sure to be slower.
3 --8
^
l
this is not a formula, it's a smiley. :X
Post 06 Feb 2008, 01:39
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

but it's sure to be slower.

Oh yeah? Why?

Perhaps that code is targetted for the NetBurst micro-architecture claimed by Intel to be superior but later retracted the idea and implemented a micro-architecture more near to PPro micro-architecture (Core series).

Anyway, inc in P4 is worst because it doesn't modify the CF so it generates a dependency that stalls execution so you must use add or lea since it doesn't touch EFLAGS at all so no CF dependency is generated.
Post 06 Feb 2008, 02:02
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4237
Location: 2018
edfed
ok
Post 06 Feb 2008, 13:31
View user's profile Send private message Visit poster's website Reply with quote
dap



Joined: 01 Dec 2007
Posts: 61
Location: Belgium
dap
LocoDelAssembly wrote:
Anyway, inc in P4 is worst because it doesn't modify the CF so it generates a dependency that stalls execution so you must use add or lea since it doesn't touch EFLAGS at all so no CF dependency is generated.


From Agner Fog's documentation :
Quote:
The INC and DEC instructions do not modify the carry flag but they do modify the other arithmetic flags. Writing to only part of the flags register costs an extra uop on P4 and P4E.
It can cause a partial flags stalls on other Intel processors if a subsequent instruction reads
al the flag bits. Furthermore, it can cause a false dependence on the carry flag from a previous instruction.

Use ADD and SUB when optimizing for speed. Use INC and DEC when optimizing for size or when no penalty is expected.

_________________
(French only) http://dap.developpez.com
Post 06 Feb 2008, 15:47
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

add or lea since it doesn't touch EFLAGS at all so no CF dependency is generated.


Of course add touches EFLAGS (including CF so it does not stalls), I was refering only to lea.

And to clarify the inc vs lea, this codes takes the same time on my computer to complete:
Code:
mov ebx, 8
xor ecx, ecx
align 16
.loop: 
  inc eax 
  dec ecx 
  jnz .loop

  dec ebx
  jnz .loop

; 34781 ms
    


Code:
mov ebx, 8
xor ecx, ecx
align 16
.loop: 
  lea eax, [eax+1]
  dec ecx 
  jnz .loop

  dec ebx
  jnz .loop 

; 34782 ms
    

Ignore the 1 ms difference, is just a timer precision problem. The test was on an Athlon64 2.0 GHz (Venice core).
Post 06 Feb 2008, 15:56
View user's profile Send private message Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo
INTEL P2 ARCHITECTURE
OPTIMIZATION MANUAL wrote:

Use one-byte instructions as much as possible. This will reduce code size and help increase
instruction density in the instruction cache. The most common example is using inc and
dec rather than adding or subtracting the constant 1 with add or sub. Another common
example is using the push and pop instructions instead of the equivalent sequence.


Also, it says LEA takes 1 clock and is pairable in either U or V pipeline. But I think an AGI can stall it (at least on a P1).
Post 08 Feb 2008, 01:53
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4237
Location: 2018
edfed
Use one-byte instructions as much as possible
Post 08 Feb 2008, 02:18
View user's profile Send private message Visit poster's website Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan
rugxulo wrote:

Use one-byte instructions as much as possible.

Also, it says LEA takes 1 clock and is pairable in either U or V pipeline.

C'mon and stop using this outdated optimization manuals. There is no U and V pipes any more, mani execution units instead, the architecture much more complicated - mu-ops, h/w register renaming, h/w prefetch, OoOE (out-of-order execution) thus any partial modifications either flags or partial registers lead to speed to speed fall.
Use <add REG, 1> or <sud reg, 1> for P4 and later for compatibility with P4 Smile

_________________
Any offers?
Post 08 Feb 2008, 06:36
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4237
Location: 2018
edfed
i've got only a PIII
soooo, i'll usse the inc dec form.
Post 08 Feb 2008, 12:52
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17249
Location: In your JS exploiting you and your system
revolution
You can always use a macro, which you selectively include just for P4 optimisation. See here where I posted such a macro 3 years ago.

BTW: This is only needed for the P4, not for earlier or later CPU's.
Post 08 Feb 2008, 13:06
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.