flat assembler
Message board for the users of flat assembler.

Index > Main > String length

Goto page Previous  1, 2, 3, 4, 5, 6  Next
Author
Thread Post new topic Reply to topic
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1178
Location: Unknown
HaHaAnonymous 23 Jun 2014, 00:30
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 18:08; edited 1 time in total
Post 23 Jun 2014, 00:30
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20454
Location: In your JS exploiting you and your system
revolution 23 Jun 2014, 00:38
If you want to reduce the byte count you can use "or eax,-1" instead of "mov eax,-1".
Post 23 Jun 2014, 00:38
View user's profile Send private message Visit poster's website Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1671
Location: Toronto, Canada
AsmGuru62 23 Jun 2014, 02:08
Sasha:
Your graphs are looking nice.
I just wanted to know how does it compare with other variants.
Smile
Post 23 Jun 2014, 02:08
View user's profile Send private message Send e-mail Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 23 Jun 2014, 14:10
Here's my crack at strlen.
* The alignment loop is only max 3 iterations so unroll it.
* No need to 'bytefind' after we know the null is in there just use BSF
* I use multiple RET paths rather than try to mangle everything into one (the trade off with call->ret pairing vs better opcode throughput seems to be worth it)
Code:
strlen_r22:
        push    ebx
        push    esi
        push    edi
        mov     eax, [esp + 16]
        mov     ebx, -01010101h
        test    eax, 3
        jz      .scan
        mov     edx, [eax]
        test    dl, dl
        jz      .found
        inc     eax
        test    eax, 3
        jz      .scan
        test    dh, dh
        jz      .found
        inc     eax
        shr     edx, 16
        test    eax, 3
        jz      .scan
        test    dl, dl
        jz      .found
        inc     eax
        jmp     .scan
align 16
.scan:
        mov     esi, [eax]
        mov     edi, [eax + 4]
        add     eax, 8
        lea     ecx, [esi + ebx]
        lea     edx, [edi + ebx]
        not     esi
        not     edi
        and     ecx, esi
        and     edx, edi
        and     ecx, 80808080h
        jnz     .foundlo
        and     edx, 80808080h
        jz      .scan
.foundhi:
        bsf     edx, edx
        sub     eax, [esp + 16]
        shr     edx, 3
        lea     eax, [eax + edx - 4]
        pop     edi
        pop     esi
        pop     ebx
        ret     4
.foundlo:
        bsf     ecx, ecx
        sub     eax, [esp + 16]
        shr     ecx, 3
        lea     eax, [eax + ecx - 8]
        pop     edi
        pop     esi
        pop     ebx
        ret     4       
.found:
        sub     eax, [esp + 16]
        pop     edi
        pop     esi
        pop     ebx
        ret     4
    


Since ideone.com has a NASM compiler I created a kludge benchmark there. http://ideone.com/S6JKdX


Last edited by r22 on 23 Jun 2014, 17:38; edited 1 time in total
Post 23 Jun 2014, 14:10
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1178
Location: Unknown
HaHaAnonymous 23 Jun 2014, 17:09
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 18:08; edited 1 time in total
Post 23 Jun 2014, 17:09
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20454
Location: In your JS exploiting you and your system
revolution 24 Jun 2014, 07:16
Why is there still no comparison with rep cmpsb? Let's at least have a basline for comparison to make sure there is a positive improvement and by how much. Is it 10% slower? 10% faster? 100% faster? Is it better only on short strings or long strings?

I remember reading that Intel had put in special support for string instructions. Perhaps AMD have also. Is it better only on AMD CPUs? Perhaps the latest CPUs get different results than older CPUs?

Edit: That should be rep scasb. Embarassed


Last edited by revolution on 24 Jun 2014, 09:12; edited 1 time in total
Post 24 Jun 2014, 07:16
View user's profile Send private message Visit poster's website Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria
JohnFound 24 Jun 2014, 09:09
Why cmpsb? Isn't it scasb?
Post 24 Jun 2014, 09:09
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20454
Location: In your JS exploiting you and your system
revolution 24 Jun 2014, 09:11
JohnFound wrote:
Why cmpsb? Isn't it scasb?
Yes it is. My mistake. Thanks for the correction.
Post 24 Jun 2014, 09:11
View user's profile Send private message Visit poster's website Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria
JohnFound 24 Jun 2014, 10:12
revolution wrote:
Yes it is. My mistake. Thanks for the correction.


I made some tests on scasb and non-scas byte scanning:

Code:
proc StrLenByte, .ptrString
begin
        mov     eax, [.ptrString]
        dec     eax

.loop:
        inc     eax
        cmp     byte [eax], 0
        jne     .loop

        sub     eax, [.ptrString]
        return
endp


proc StrLenByteS, .ptrString
begin
        push    edi ecx

        mov     edi, [.ptrString]
        or      ecx, -1
        xor     eax, eax

        repne scasb
        lea     eax, [edi-1]
        sub     eax, [.ptrString]

        pop     ecx edi
        return
endp    


The results are showed on the graphic. The lables are:
1. Sasha - the Sasha algorithm.
2. Dword - the same algorithm, but only compares dwords.
3. Byte - the above "StrLenByte" procedure.
4. ByteS - the above "StrLenByteS" procedure (scasb).

The X coordinate is the length of the string.
The Y coordinate is the time for one execution in [ns] (Actually the time for 1000000 executions in [ms], which is the same).

The CPU is Intel Core I3 @3.4GHz, Windows 7 64bit.


Description:
Filesize: 24.25 KB
Viewed: 15060 Time(s)

byte.png



_________________
Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9
Post 24 Jun 2014, 10:12
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20454
Location: In your JS exploiting you and your system
revolution 24 Jun 2014, 10:17
JohnFound: It would help to know which CPU you are testing with.

Also are all strings already in cache before scanning?
Post 24 Jun 2014, 10:17
View user's profile Send private message Visit poster's website Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria
JohnFound 24 Jun 2014, 10:28
revolution wrote:
JohnFound: It would help to know which CPU you are testing with.

Also are all strings already in cache before scanning?


The post is edited and the CPU type added.

The benchmark creates an array of 1000 strings, with the given length then calls every procedure 1000 times on every string from the array. In the presented run, all strings are aligned on 16 bytes.
For every length (from 16 to 1500) new array of strings is created.

_________________
Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9
Post 24 Jun 2014, 10:28
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20454
Location: In your JS exploiting you and your system
revolution 24 Jun 2014, 10:31
Okay, so the strings are already in cache and the branch predictor will have an easy time building history and exiting without misprediction.

I wonder how different the times would be for uncached strings of [pseudo-]random length (which is probably a better model for a real world application)?
Post 24 Jun 2014, 10:31
View user's profile Send private message Visit poster's website Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria
JohnFound 24 Jun 2014, 10:35
Well, I tried to create an array with 1000000 unique strings (but still the same length) and the results was pretty the same actually. The size of the data is (on length 1500) around 1.5GBytes, so the strings was guaranteed not in the cache.
Post 24 Jun 2014, 10:35
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 24 Jun 2014, 12:42
Unrolling the main loop (4x) in my variation (sasha + agner fog) adds further improvement.
Code:
strlen_unrolled:
        push    ebx
        push    esi
        push    edi
        mov     eax, [esp + 16]
        mov     ebx, -01010101h
        test    eax, 3
        jz      .scan
        mov     edx, [eax]
        test    dl, dl
        jz      .found
        inc     eax
        test    eax, 3
        jz      .scan
        test    dh, dh
        jz      .found
        inc     eax
        shr     edx, 16
        test    eax, 3
        jz      .scan
        test    dl, dl
        jz      .found
        inc     eax
        jmp     .scan
.found:
        sub     eax, [esp + 16]
        pop     edi
        pop     esi
        pop     ebx
        ret     4       
align 16
.scan:
        mov     esi, [eax]
        mov     edi, [eax + 4]
        add     eax, 8
        lea     ecx, [esi + ebx]
        lea     edx, [edi + ebx]
        not     esi
        not     edi
        and     ecx, esi
        and     edx, edi
        and     ecx, 80808080h
        jnz     .foundlo
        and     edx, 80808080h
        jnz     .foundhi
        mov     esi, [eax]
        mov     edi, [eax + 4]
        add     eax, 8
        lea     ecx, [esi + ebx]
        lea     edx, [edi + ebx]
        not     esi
        not     edi
        and     ecx, esi
        and     edx, edi
        and     ecx, 80808080h
        jnz     .foundlo
        and     edx, 80808080h
        jnz     .foundhi
        mov     esi, [eax]
        mov     edi, [eax + 4]
        add     eax, 8
        lea     ecx, [esi + ebx]
        lea     edx, [edi + ebx]
        not     esi
        not     edi
        and     ecx, esi
        and     edx, edi
        and     ecx, 80808080h
        jnz     .foundlo
        and     edx, 80808080h
        jnz     .foundhi
        mov     esi, [eax]
        mov     edi, [eax + 4]
        add     eax, 8
        lea     ecx, [esi + ebx]
        lea     edx, [edi + ebx]
        not     esi
        not     edi
        and     ecx, esi
        and     edx, edi
        and     ecx, 80808080h
        jnz     .foundlo
        and     edx, 80808080h
        jz      .scan           
.foundhi:
        bsf     edx, edx
        sub     eax, [esp + 16]
        shr     edx, 3
        lea     eax, [eax + edx - 4]
        pop     edi
        pop     esi
        pop     ebx
        ret     4
.foundlo:
        bsf     ecx, ecx
        sub     eax, [esp + 16]
        shr     ecx, 3
        lea     eax, [eax + ecx - 8]
        pop     edi
        pop     esi
        pop     ebx
        ret     4
    

http://ideone.com/XyVmKH

Depending on average size you're testing for and if you can avoid poking over a memory boundary, using SIMD would probably be faster. The same algorithm would apply just use an unaligned load for the alignment then for the main loop use aligned loads (MOVDQA). PTEST for the comparison and PMOVMSKB + BSF when found.
Post 24 Jun 2014, 12:42
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Sasha



Joined: 17 Nov 2011
Posts: 93
Sasha 24 Jun 2014, 15:30
JohnFound, how do you count the time? I used GetTickCount and a very large number of execution, so it was problematic to test the big ranges. Now I'm porting to rdtsc. It is very accurate, so it makes even more problems. But now, on the large range I see the performance gain for unrolling the loop, wich I didn't saw before. It begins after the length of 512 bytes on my computer. Also, I use the same buffer for tests, now I'll try to create a separate buffer for every execution.
r22, the bsf is the real gem!
Post 24 Jun 2014, 15:30
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 24 Jun 2014, 15:58
The BSF + SHR usage is from Agner Fog's optimized strlen routine.

If you check the IDEONE link from my last post and check out the "benchmark" and "makestring" procedures they seem to work pretty well. I create 4 copies of the same string with different alignment so that the alignment portion of the code is accounted for. But I think for a big optimized string len you would have aligned buffers already, so maybe the pre alignment stuff is out of scope.
Post 24 Jun 2014, 15:58
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1671
Location: Toronto, Canada
AsmGuru62 24 Jun 2014, 16:53
QueryPerformanceCounter() is probably better than GetTickCount().
Post 24 Jun 2014, 16:53
View user's profile Send private message Send e-mail Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria
JohnFound 24 Jun 2014, 16:56
@Sasha: I am counting the time using FreshLib GetTimestamp procedure, that is in Windows: GetTickCount and in Linux: sys_gettimeofday.
Anyway, as long as the test continues long enough (for 1000000 times of execution for every string length and function) the overall CPU load influences the measured values, making the test close to the real life. Actually I am not a fan of optimizing to the last extent, but some of the results are really strange for me.
Later, I will publish some results from another CPU: Intel Atom and AMD Sempron, that are even stranger. Smile
Post 24 Jun 2014, 16:56
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
Sasha



Joined: 17 Nov 2011
Posts: 93
Sasha 24 Jun 2014, 17:17
With rdtsc even one execution can be tested. Now I've made a benchmaker, that counts the rdtsc of single execution and the most important it takes the new string and the new instance of the code for every test. I called it "cold benchmark" because of it. Look at the code. Some macros need to be created, if such approach worth it.
This is the output of my core2duo 2.2 Ghz
3927
781
154
550
473
473
451
143
176
176
176
165
528
165
176
352
220
187
220
176
187
198
198
231
242
209
231
253
231
242
165
264

The first number is always much bigger than the rest.


Description:
Download
Filename: coldbenchmark.zip
Filesize: 3.15 KB
Downloaded: 528 Time(s)

Post 24 Jun 2014, 17:17
View user's profile Send private message Reply with quote
Sasha



Joined: 17 Nov 2011
Posts: 93
Sasha 24 Jun 2014, 17:54
JohnFound wrote:
Actually I am not a fan of optimizing to the last extent.

Yes, and after all this, we still don't know what is better. In real life this function can be called very rare and if so, the optimized by size version can be better..
The only solution, that I see is to test the code on the real tests. For example, compiler can be tested compiling a real code.
Post 24 Jun 2014, 17:54
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5, 6  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.