flat assembler
Message board for the users of flat assembler.
Index
> Main > fast strlen Goto page 1, 2, 3 Next |
Author |
|
lovefasm 12 Oct 2007, 15:09
;注:高效StrLen函数汇编代码实现
;经过在 Visual C++ 2005 测试,速度远超越了其所带C库的 strlen 函数,所用时间比 8/13 ;未检测参数有效性,建议用SEH __try/__except(XXX)检测 ;编译工具 fasm 1.67.23 ;完成时间 2007.10.12 晚 format ms coff public StrLen as '_StrLen@4' ;int __stdcall StrLen(const char* str) ;C调用原形 StrLen: .string equ dword [esp+4] mov ecx,.string ; ecx = .string test ecx,3 ; 测试string 首地址是否4字节对齐(Visual C++ 默认生成的其实都是4字节对齐) je .main_loop .str_misaligned: ; 拷贝对齐前的1~3个字节 mov al,[ecx] inc ecx test al,al je .byte_3 test ecx,3 jne .str_misaligned ; 对齐 .main_loop: mov eax,[ecx] ; 读取 4 bytes add ecx,4 test al,al ; is it byte 0 je .byte_0 ; 多计算了4 byte test ah,ah ; is it byte 1 je .byte_1 ; 多计算了3 byte shr eax,16 test al,al ; is it byte 2 je .byte_2 ; 多计算了2 byte test ah,ah ; is it byte 3 je .byte_3 ; 多计算了1 byte jmp .main_loop ; 准备读取下一个 4 bytes ;减去多计算的字节数 .byte_3: dec ecx jmp .result .byte_2: sub ecx,2 jmp .result .byte_1: sub ecx,3 jmp .result .byte_0: sub ecx,4 .result: mov eax,ecx sub eax,.string ret 4
|
|||||||||||
12 Oct 2007, 15:09 |
|
vid 12 Oct 2007, 15:23
not as fast though... there are some ways to test entire dword with single comparing, after doing som bit magic with the dword
|
|||
12 Oct 2007, 15:23 |
|
Mac2004 12 Oct 2007, 18:03
lovefasm: Could you please use code tags? They improve readability of your code quite bit.
regards, Mac2004 |
|||
12 Oct 2007, 18:03 |
|
SomeoneNew 29 Dec 2007, 14:45
Either that or storing the string-length after length-changing operations
_________________ Im new, sorry if I bothered with any stupid question |
|||
29 Dec 2007, 14:45 |
|
Vasilev Vjacheslav 18 Jan 2008, 13:45
what about Agner Fog strlen?
|
|||
18 Jan 2008, 13:45 |
|
asmhack 05 Feb 2008, 23:39
i use the below code to get null-terminated string length
small and fast (on 'short' strings).. Code: @@strlen: mov ecx,[esp+$4] or eax,-$1 @@: lea eax,[eax+$1] cmp byte[ecx+eax],$0 jnz @b ret |
|||
05 Feb 2008, 23:39 |
|
AlexP 06 Feb 2008, 00:10
Yeah, that's pretty much what I use, C compilers have almost same output.
|
|||
06 Feb 2008, 00:10 |
|
edfed 06 Feb 2008, 01:25
asmhack:
lea eax,[eax+1] ?????? inc eax is better... |
|||
06 Feb 2008, 01:25 |
|
AlexP 06 Feb 2008, 01:26
edfed wrote: asmhack: lol |
|||
06 Feb 2008, 01:26 |
|
edfed 06 Feb 2008, 01:39
inc is executable in all pipelines each cycle, for a pentium I (2 pipes) we can execute 2 inc in one clock.
i don't know lea timings, but it's sure to be slower. 3 --8 ^ l this is not a formula, it's a smiley. :X |
|||
06 Feb 2008, 01:39 |
|
LocoDelAssembly 06 Feb 2008, 02:02
Quote:
Oh yeah? Why? Perhaps that code is targetted for the NetBurst micro-architecture claimed by Intel to be superior but later retracted the idea and implemented a micro-architecture more near to PPro micro-architecture (Core series). Anyway, inc in P4 is worst because it doesn't modify the CF so it generates a dependency that stalls execution so you must use add or lea since it doesn't touch EFLAGS at all so no CF dependency is generated. |
|||
06 Feb 2008, 02:02 |
|
edfed 06 Feb 2008, 13:31
ok
|
|||
06 Feb 2008, 13:31 |
|
dap 06 Feb 2008, 15:47
LocoDelAssembly wrote: Anyway, inc in P4 is worst because it doesn't modify the CF so it generates a dependency that stalls execution so you must use add or lea since it doesn't touch EFLAGS at all so no CF dependency is generated. From Agner Fog's documentation : Quote: The INC and DEC instructions do not modify the carry flag but they do modify the other arithmetic flags. Writing to only part of the flags register costs an extra uop on P4 and P4E. |
|||
06 Feb 2008, 15:47 |
|
LocoDelAssembly 06 Feb 2008, 15:56
Quote:
Of course add touches EFLAGS (including CF so it does not stalls), I was refering only to lea. And to clarify the inc vs lea, this codes takes the same time on my computer to complete: Code: mov ebx, 8 xor ecx, ecx align 16 .loop: inc eax dec ecx jnz .loop dec ebx jnz .loop ; 34781 ms Code: mov ebx, 8 xor ecx, ecx align 16 .loop: lea eax, [eax+1] dec ecx jnz .loop dec ebx jnz .loop ; 34782 ms Ignore the 1 ms difference, is just a timer precision problem. The test was on an Athlon64 2.0 GHz (Venice core). |
|||
06 Feb 2008, 15:56 |
|
rugxulo 08 Feb 2008, 01:53
INTEL P2 ARCHITECTURE
OPTIMIZATION MANUAL wrote:
Also, it says LEA takes 1 clock and is pairable in either U or V pipeline. But I think an AGI can stall it (at least on a P1). |
|||
08 Feb 2008, 01:53 |
|
edfed 08 Feb 2008, 02:18
Use one-byte instructions as much as possible
|
|||
08 Feb 2008, 02:18 |
|
asmfan 08 Feb 2008, 06:36
rugxulo wrote:
C'mon and stop using this outdated optimization manuals. There is no U and V pipes any more, mani execution units instead, the architecture much more complicated - mu-ops, h/w register renaming, h/w prefetch, OoOE (out-of-order execution) thus any partial modifications either flags or partial registers lead to speed to speed fall. Use <add REG, 1> or <sud reg, 1> for P4 and later for compatibility with P4 _________________ Any offers? |
|||
08 Feb 2008, 06:36 |
|
edfed 08 Feb 2008, 12:52
i've got only a PIII
soooo, i'll usse the inc dec form. |
|||
08 Feb 2008, 12:52 |
|
revolution 08 Feb 2008, 13:06
You can always use a macro, which you selectively include just for P4 optimisation. See here where I posted such a macro 3 years ago.
BTW: This is only needed for the P4, not for earlier or later CPU's. |
|||
08 Feb 2008, 13:06 |
|
Goto page 1, 2, 3 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.