flat assembler
Message board for the users of flat assembler.

Index > Main > String length

Goto page Previous  1, 2, 3, 4, 5, 6  Next
Author
Thread Post new topic Reply to topic
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1178
Location: Unknown
HaHaAnonymous 05 Nov 2013, 00:28
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 19:14; edited 1 time in total
Post 05 Nov 2013, 00:28
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20448
Location: In your JS exploiting you and your system
revolution 05 Nov 2013, 01:02
HaHaAnonymous wrote:
Humans have an advanced "function" called presentiment. It is not always accurate, but can be improved with training.l
How does this relate to making timing measurements?
Post 05 Nov 2013, 01:02
View user's profile Send private message Visit poster's website Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1178
Location: Unknown
HaHaAnonymous 05 Nov 2013, 01:32
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 19:13; edited 1 time in total
Post 05 Nov 2013, 01:32
View user's profile Send private message Reply with quote
ejamesr



Joined: 04 Feb 2011
Posts: 52
Location: Provo, Utah, USA
ejamesr 05 Nov 2013, 18:22
revolution asked for timing measurements; the differences are slight. Is it worth it? Probably not. But... I've found these inner-loop differences when using negative counters to be much greater in smaller loops, where the value of eliminating a single line of code affects a greater percentage of the time.

Here's what I measured when testing 100*1000*1000 reps five times:
Code:
This was based on my Core2 Duo 2.66Ghz HP HDX-16t laptop.
Timings based on 100 million reps, five sets,
compared performance with different numbers of white-space chars
in front of the token; avg CPU clocks displayed:

Proc            0 w/s chars 4 w/s chars 8 w/s chars 12 w/s chars
----            ----------- ----------- ----------- ------------
FindToken       431af95e    842e51f2    bf0e6ce8    1:05d1ecf8
FindToken_rev   4325fefa    72642f88    bc81d24c      ff5d254a    
My test code is a bit raw; you have to modify the constants at the top of the file and run it under a debugger to inspect edx:eax at the breakpoint (which is the average # clocks over five runs). I used Ollydbg 2.0, recording the value of edx:eax at the break point.

When you look at my code, I reversed the return values used to denote 'no token' and 'too big' in order to take advantage of what was happening at the register level. Here's my test code:
Code:
format PE CONSOLE 4.0
stack   4000h, 4000h
entry Start

; Constants to modify
NUMREPS = 1000*1000*100
NUMSETS = 5
NUMWHITESPACE = 12              ; test values 0, 4, 8, 12
MAXTOKENLEN   = 20

TESTPROC equ FindToken
TESTPROC equ FindToken_rev      ; comment out to test 'FindToken'

include 'Win32A.Inc'
include 'Macro\If.Inc'

section '.data' data readable writeable
        Divisor dd   NUMSETS
        TStart dq       ?
        TCum   dq       0
tempToken:
        times NUMWHITESPACE db ' '
        db 'Token', 0

align 16
proc    FindToken Offset,MaxLen         ; original by Sasha

        mov     eax,[Offset]
        mov     ecx,[MaxLen]
        add     ecx,eax
        xor     edx,edx
        dec     eax
  .loop:
        inc     eax
        cmp     eax,ecx
        ja      .finish_too_big
        mov     dl,byte[eax]
        cmp     edx,020h
        je      .loop
        cmp     edx,09h
        je      .loop
        or      edx,edx
        je      .finish_no_token

  .finish:      ;Returns an offset in eax
        ret

  .finish_too_big:
        or      eax,-1
        ret

  .finish_no_token:
        xor     eax,eax
        ret    
endp      

align 16
proc    FindToken_rev Offset,MaxLen ; revised by Eric
        mov     ecx,[Offset]
        mov     eax,[MaxLen]     ; use eax as counter
        add     ecx, eax         ; point to end of data
        neg     eax              ; eax is now a negative counter
        dec     eax

  .loop:
        inc     eax              ; point to next byte AND test for MaxLen
        jz      .finish_too_big  ; when eax becomes positive, the end was reached
        mov     dl,byte[eax+ecx]
        cmp     dl,020h          ; using 'edx' instead of 'dl' incurs 'partial-register-stall' penalty
        je      .loop
        cmp     dl,09h
        je      .loop
        test    dl, dl           ; in some cases this is faster than 'or dl,dl'
        jz      .finish_no_token

  .finish:      ;Returns an offset in eax
        lea     eax, [eax+ecx]   ; quickly compute the offset
        ret

  .finish_too_big:               ; at this point, eax is 0... if you use 0 to
                                 ;  indicate "too big" you can save another instruction here
        ret

  .finish_no_token:
        or      eax,-1           ; use -1 to indicate no token
        ret    
endp


Start:
        xor     eax, eax
        mov     dword [TCum], eax
        mov     dword [TCum+4], eax

        rdtsc
        mov     dword [TStart], eax
        mov     dword [TStart+4], edx
        mov     edi, NUMSETS
.newSet:
        mov     esi, NUMREPS
.newRep:        
        stdcall TESTPROC, tempToken, MAXTOKENLEN
        dec     esi             ; do next rep
        jnz     .newRep
        ; All reps done, record time
        rdtsc
        push    eax
        push    edx
        ; Get elapsed time
        sub     eax, dword [TStart]
        sbb     edx, dword [TStart+4]
        ; Accumulate time
        add     dword [TCum], eax
        adc     dword [TCum+4], edx
        ; reset start time...
        pop     edx
        pop     eax
        mov     dword [TStart], eax
        mov     dword [TStart+4], edx
        ; See if another set
        dec     edi
        jnz     .newSet
        ; Finished, calculate average time
        mov     eax, dword [TCum+4]
        xor     edx, edx
        div     [Divisor]
        push    eax                     ; save quotient, combine after next step
        mov     eax, dword [TCum]
        div     [Divisor]
        add     edx, [esp]
        add     esp, 4                  ; clean up stack
int3
        invoke  ExitProcess, 0

section '.idata' import data readable writeable
        library kernel32,'KERNEL32.DLL',user32,'USER32.DLL',gdi32,'GDI32.DLL'
        include 'API\Kernel32.Inc'
        include 'API\User32.Inc'
        include 'API\Gdi32.Inc'

section '.edata' export data readable
        export 'Program.exe',\
Start,'Start',\
Start.newSet,'Start_newSet',\
Start.newRep,'Start_newRep',\
FindToken,'FindToken',\
FindToken_rev,'FindToken_rev',\
Divisor,'Divisor',\
TStart,'TStart',\
TCum,'TCum'

section '.reloc' fixups data readable discardable    
I didn't check to see what happens when the inner loop is aligned. Also, eliminating the stack frame would can make a difference. And register-based parameters can speed it up, too. And there's more beyond those that can still be done... I'll leave all this for others to test. Smile

Gotta get back to work now.

Eric
Post 05 Nov 2013, 18:22
View user's profile Send private message Send e-mail Reply with quote
Sasha



Joined: 17 Nov 2011
Posts: 93
Sasha 05 Nov 2013, 18:55
I've made all the changes.
Code:
proc    StrzLen4 Strz,MaxLen

        mov     eax,[MaxLen]
        mov     ecx,[Strz]
        add     ecx,eax
        neg     eax
        dec     eax
        jmp     @f
  .loop:
        cmp     byte[ecx+eax],0
        jz      .finish_ok
  @@:
        inc     eax
        js      .loop

  .too_big:
        or      eax,-1
        ret
  .finish_ok:
        add     eax,[MaxLen];Fixed. Using eax as a counter now.
        ret
endp                           
    

I thought about benchmarking, too. But I need another proc to compare with. To get a relative values. Also need some real strings to parse, otherwise it is a spherical cow)


Last edited by Sasha on 05 Nov 2013, 20:15; edited 2 times in total
Post 05 Nov 2013, 18:55
View user's profile Send private message Reply with quote
ejamesr



Joined: 04 Feb 2011
Posts: 52
Location: Provo, Utah, USA
ejamesr 05 Nov 2013, 19:40
Looks good, but the 'lea' line should probably be something like 'add ecx, [MaxLen]'.

If speed is important to you, benchmarking can be your best friend. But... benchmarking itself can produce results that are inconsistent and that depend on what's happening in the data and instruction caches due to what other processes are doing that can 'contaminate' the cache that your program has created due to normal program execution.

I see your code jumps around .loop. That makes available another little improvement that can sometimes speed things up. Add 'align 16' just in front of .loop. Since there's a jump around it, the 'nop' instructions never get executed. Or, a more sophisticated way would be to align the start of the procedure so that .loop is exactly aligned on a 16-byte boundary. This helps ensure that all instructions of the small loop fit on as few cache lines as possible, running as fast as they can.

Back to work for me now...
Post 05 Nov 2013, 19:40
View user's profile Send private message Send e-mail Reply with quote
Sasha



Joined: 17 Nov 2011
Posts: 93
Sasha 05 Nov 2013, 22:27
Benchmark showed that this code is very little, but faster than the previous.
Code:
proc    StrzLen5 Strz,MaxLen

        mov     eax,[MaxLen]
        mov     ecx,[Strz]
        add     ecx,eax
        neg     eax
        dec     eax
  .loop:
        inc     eax
        jns     .too_big
        cmp     byte[ecx+eax],0
        jnz     .loop
  .finish_ok:
        add     eax,[MaxLen]
        ret

  .too_big:
        or      eax,-1
        ret
endp                             
    

Just removed the jmp. So even such little differences can be seen.
Post 05 Nov 2013, 22:27
View user's profile Send private message Reply with quote
Sasha



Joined: 17 Nov 2011
Posts: 93
Sasha 05 Nov 2013, 22:57
This is my benchmaker)


Description:
Download
Filename: benchstrlen.ASM
Filesize: 3.45 KB
Downloaded: 543 Time(s)

Post 05 Nov 2013, 22:57
View user's profile Send private message Reply with quote
ejamesr



Joined: 04 Feb 2011
Posts: 52
Location: Provo, Utah, USA
ejamesr 05 Nov 2013, 23:18
Sasha,

I'm glad you are willing to test these things... it makes you better and your code faster.

Here's another thing to test. On some computers, a command like:
Code:
cmp     byte[ecx+eax],0    
can be faster if implemented like this:
Code:
mov     dl, [ecx+eax]
test    dl, dl     ; or 'cmp   reg, immValue'     

I always use 'test' rather than 'or' because I suspect it's slightly faster, but I've never tested it. HOWEVER, everything changes; what is true today may change tomorrow as technology advances. It's good to retest things periodically to see if new CPU architectures have changed things.

As you've seen, sometimes very small differences can be measured, and they can add up to a noticeable difference. And there is almost ALWAYS a way to still make things faster.

Happy programming!

Eric
Post 05 Nov 2013, 23:18
View user's profile Send private message Send e-mail Reply with quote
Sasha



Joined: 17 Nov 2011
Posts: 93
Sasha 06 Nov 2013, 00:14
"test dl,dl" and "test edx,edx" is almost x3 faster than "or dl,dl" and "or edx,edx" !!!
I used this approach
Code:
;Put your code here. Do not use ebx
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
        test    edx,edx
        test    edx,edx
        test    edx,edx
        test    edx,edx
        test    edx,edx
        test    edx,edx
        test    edx,edx
        test    edx,edx
        test    edx,edx
        test    edx,edx
        test    edx,edx
        test    edx,edx
        test    edx,edx
        test    edx,edx
        test    edx,edx
    

I think, the explanation is that in this case, "test" doesn't wait the previous "test" to finish, and the instructions can be paired. While "or" does wait.
But such a code will not be met in the real world, so yes, only the real code must be benchmarked.
Post 06 Nov 2013, 00:14
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20448
Location: In your JS exploiting you and your system
revolution 06 Nov 2013, 13:39
It is good to see some proper testing going on here. So now you will know if you are making any significant changes or not.

But I do encourage you to do a proper utility analysis. ejamesr asked above "Is it worth it?" And the question is very important and should not be ignored lightly. Is it worth it to spend two weeks optimising your code for speed (tested for speed on one CPU only also) and manage to save 1μs per execution, and then find that for the lifetime of the program (say five years) of 1 billion executions you can only save 1000 seconds total (less than one hour)?
Post 06 Nov 2013, 13:39
View user's profile Send private message Visit poster's website Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1178
Location: Unknown
HaHaAnonymous 06 Nov 2013, 15:43
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 19:13; edited 2 times in total
Post 06 Nov 2013, 15:43
View user's profile Send private message Reply with quote
ejamesr



Joined: 04 Feb 2011
Posts: 52
Location: Provo, Utah, USA
ejamesr 06 Nov 2013, 18:13
Good points from revolution.

Optimization may make little sense if the majority of your program's execution is in a wait loop waiting for user input (can you even build a "faster" wait loop??). And it may make little sense to try to speed up a core inner loop if the data being accessed is not in the L2 cache, where you may end up waiting hundreds of nanoseconds for the data to be accessible (try accessing elements of a 500MB array randomly, compared to accessing a similar 500k array in the same way -- I've seen differences from about 10x to 100x slower, and even worse, with the larger array!).

But... having a sense of a "need for speed" can help motivate you to think up better algorithms that will provide the serious performance leaps you want. It doesn't hurt to improve your understanding of what actually happens inside the different flavors of different CPUs. It's good to continue learning faster/better ways to do things.

Making it work properly is most important... but still not good enough, for me. It's probably a genetic defect that causes me to keep wanting to make things faster and faster. The reality, however, is you have to find the proper balance.

For example, I put off learning how to use the FASM 'match' directive until a couple weeks ago. Now, with a far better (but still imperfect) understanding, I've created several helper macros that are already saving me time and maintenance headaches when programming. (And I'll either be able to get stuff out the door faster, OR have more time to optimize the critical areas.)

Getting projects "out the door quickly" is sometimes more important than having them execute another 4.67% faster; sometimes "getting it done NOW" is where the real need for speed is. But it's always fun to try shaving off a little bit more execution time; and sometimes that's when the real blockbuster innovations come.
Post 06 Nov 2013, 18:13
View user's profile Send private message Send e-mail Reply with quote
Sasha



Joined: 17 Nov 2011
Posts: 93
Sasha 18 Jun 2014, 18:26
JohnFound wrote:
If you want to optimize for speed you definitely should not compare byte by byte. It is the slowest possible solution. Below is the version of StrLen procedure from FreshLib. As you can see, it reads and compares the string by double words.

But what is more important, using ASCIIZ is not the best option both for speed and security. As you can see, when the input string is native FreshLib string (With handle >= $c0000000) it uses format where on the address-4 the actual length is stored (so called Pascal strings). In this case, the speed of the StrLen is O(1), which is the fastest possible at all.

Code:
proc StrLen, .hString  
begin
        mov     eax, [.hString]
        cmp     eax, $c0000000
        jb      .pointer

        stdcall StrPtr, eax
        jc      .error

        mov     eax, [eax+string.len]
        clc
        return

.error:
        xor     eax, eax
        stc
        return

.pointer:
        push    ecx edx esi edi

; align on dword
.byte1:
        test    eax, 3
        jz      .scan

        cmp     byte [eax], 0
        je      .found

        inc     eax
        jmp     .byte1

.scan:
        mov     ecx, [eax]
        mov     edx, [eax+4]

        lea     eax, [eax+8]

        lea     esi, [ecx-$01010101]
        lea     edi, [edx-$01010101]

        not     ecx
        not     edx

        and     esi, ecx
        and     edi, edx

        and     esi, $80808080
        and     edi, $80808080

        or      esi, edi
        jz      .scan

        sub     eax, 9

; byte 0 was found: so search by bytes.
.byteloop:
        lea     eax, [eax+1]
        cmp     byte [eax], 0
        jne     .byteloop

.found:
        sub     eax, [.hString]
        clc
        pop     edi esi edx ecx
        return
endp    


Decided to come back to this question, and finally this is what I've got.
Code:
align 32
proc    strlen uses ebx esi edi,str

        mov     eax,[str]
        mov     ebx,-01010101h

  .aligning:
        test    eax,3
        jz      .scan

        mov     dl,[eax]
        test    dl,dl
        jz      .found

        inc     eax

        jmp     .aligning

align 32
  .scan:
        mov     esi,[eax]
        mov     edi,[eax+4]

        lea     eax,[eax+8]

        lea     ecx,[esi+ebx]   ;!
        lea     edx,[edi+ebx]

        not     esi
        not     edi

        and     ecx,esi
        and     edx,edi

        and     ecx,$80808080
        and     edx,$80808080

        test    ecx,ecx         ;!!
        jnz     .sub8
        test    edx,edx
        jz      .scan

; byte 0 was found: so search by bytes.
        lea     eax,[eax-4]
        mov     ecx,edx
        jmp     .bytesearch

  .sub8:
        lea     eax,[eax-8]

  .bytesearch:                  ;!!!
        test    cl,cl
        jnz     .found
        inc     eax
        test    ch,ch
        jnz     .found
        shr     ecx,16
        inc     eax
        test    cl,cl
        jnz     .found
        inc     eax

  .found:
        sub     eax,[str]
        ret
endp                                                                
    
Post 18 Jun 2014, 18:26
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria
JohnFound 19 Jun 2014, 06:34
Sasha, did you made some benchmarks on this code?
Post 19 Jun 2014, 06:34
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 19 Jun 2014, 11:22
This topic has been discussed in the past if you use the forum's search. http://board.flatassembler.net/topic.php?t=7710
Post 19 Jun 2014, 11:22
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Sasha



Joined: 17 Nov 2011
Posts: 93
Sasha 19 Jun 2014, 11:27
Yes, I've made after every change, to know if I'm in the right dirrection. But, this is only for my computer.


Description:
Filesize: 19.91 KB
Viewed: 13777 Time(s)

strlen_bench.png


Post 19 Jun 2014, 11:27
View user's profile Send private message Reply with quote
Sasha



Joined: 17 Nov 2011
Posts: 93
Sasha 19 Jun 2014, 11:42
r22 wrote:
This topic has been discussed in the past if you use the forum's search. http://board.flatassembler.net/topic.php?t=7710

I saw it, but then I focused on optimizing fresh lib's routine. I'll review that thread for another ideas.
Post 19 Jun 2014, 11:42
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria
JohnFound 19 Jun 2014, 11:52
Sasha wrote:
Yes, I've made after every change, to know if I'm in the right dirrection. But, this is only for my computer.


Hm, the result is interesting. The performance gain is small, but the optimized code graphic is much smoother than the original. I am curious what change caused this smoothness?

_________________
Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9
Post 19 Jun 2014, 11:52
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
Sasha



Joined: 17 Nov 2011
Posts: 93
Sasha 19 Jun 2014, 12:12
The smoothness is what I really wanted to gain. The worst thing is this "sub eax,9" and then scan again.
I tried to extract the information about where is zero byte without scaning again. A saw, that all that 8 bytes are already in registers.
Post 19 Jun 2014, 12:12
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5, 6  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.