flat assembler
Message board for the users of flat assembler.

Index > Main > String length

Goto page Previous  1, 2, 3, 4, 5, 6  Next
Author
Thread Post new topic Reply to topic
JohnFound



Joined: 16 Jun 2003
Posts: 3500
Location: Bulgaria
JohnFound
BTW, RDTSC is not the best way to measure code performance. At least it has no physical meaning. Notice, that the modern CPU can change its clock frequency during execution.

IMHO, using QueryPerformanceFrequency/QueryPerformanceCounter is better solution in Windows.
Post 24 Jun 2014, 18:12
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
@sasha if you are trying to do fine grained timings with RDTSC you will need to do the following:

* Disable turbo boost (usual C-state and power settings in your BIOS)

* Try to force the process to the highest priority class

* Try to force the executing thread to be scheduled on a single cpu core

* Try to force the executing thread to the highest priority level

* Execute a warmup loop right before the timing. This tries to force your code into a single scheduled time slice and enable turbo boost/cpu frequency increases

On windows something like this...
Code:
_start:
        ;; process priority class
        invoke  GetCurrentProcess
        push    0x100 ;;REALTIME_PRIORITY_CLASS
        push    eax
        invoke  SetPriorityClass
        ;; thread priority
        invoke  GetCurrentThread
        push    15 ;;THREAD_PRIORITY_TIME_CRITICAL
        push    eax
        invoke  SetThreadPriority
        ;; thread cpu core affinity
        invoke  GetCurrentThread
        push    2 ;;use CPU Core 2
        push    eax
        invoke  SetThreadAffinityMask
...
        push    strbuf
        push    strlen_optimized
        call    TimeSingleCall
...

TimeSingleCall: ;; (func, arg)
        push    ebp
        push    ebx
        push    esi
        push    edi
        mov     ebp, [esp + 20] ;; func
        mov     ebx, [esp + 24] ;; arg
        ;;
        xor     eax, eax
.warmup:
        add     ecx, eax
        sub     eax, 1
        jnz     .warmup
        ;;
        rdtsc
        mov     esi, eax
        mov     edi, edx
        push    ebx
        call    ebp
        rdtsc
        sub     eax, esi
        sbb     edx, edi
        ;;
        pop     edi
        pop     esi
        pop     ebx
        pop     ebp
        ret     8
    


And even after all of that windows will still schedule and prioritize the way it wants, but the above should be more than close enough. Calling the function you want to test 1000000 times and taking the average will always be the better solution.
Post 24 Jun 2014, 19:50
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17287
Location: In your JS exploiting you and your system
revolution
r22 wrote:
* Execute a warmup loop right before the timing.
I think this is a mistake. A normal program won't do this so why should a benchmark? Doing this will give a false impression of the performance when used in a real situation. Hmm, actually what am I saying, ALL benchmarks give a false impression so this code should only be tested within the target application under real world usage conditions.
Post 24 Jun 2014, 20:09
View user's profile Send private message Visit poster's website Reply with quote
Sasha



Joined: 17 Nov 2011
Posts: 93
Sasha
revolution, we need this to use rdtsc. As I wrote before It makes more problems, than an inacurate gettickcount. Now I see what could be the reasons.
Post 24 Jun 2014, 20:17
View user's profile Send private message Reply with quote
baldr



Joined: 19 Mar 2008
Posts: 1651
baldr
r22 wrote:
…Try to…
Execute that code from a primitive bootloader? Wink
Post 24 Jun 2014, 21:13
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
@revolution - If you are benchmarking a library function with global utility than there would be quite a large set of real use applications to test it within. I think getting the optimal speed at which a procedure can run is a valid metric for comparing it's performance with other similar procedures. I can't think of a situation *for this specific case (strlen)* where a worst case benchmark would be helpful. Now if the procedure we were benchmarking could possibly have large lookup tables than testing against a CPU loaded cache starved scenario would be a valid metric too.

@baldr - Lets just throw together a 32bit real-time operating system to benchmark our string length optimizations Very Happy Maybe a driver/kernel module would be sufficient, since I'd rather not reboot every time I want to run another benchmark Very Happy
Post 25 Jun 2014, 00:35
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17287
Location: In your JS exploiting you and your system
revolution
r22 wrote:
@revolution - If you are benchmarking a library function with global utility than ...
... we can provide a number of different options for the application programmer to choose from based upon the applications needs. And if the application is not sensitive to 10ns and/or 6 bytes improvement then the application programmer can just choose one option and not care about trying others to see which is "best".

I don't subscribe to the "one solution for all situations" paradigm. There needs to be options because all situations are unique and have their own requirements.
Post 25 Jun 2014, 00:44
View user's profile Send private message Visit poster's website Reply with quote
Sasha



Joined: 17 Nov 2011
Posts: 93
Sasha
r22 wrote:
@baldr - Lets just throw together a 32bit real-time operating system to benchmark our string length optimizations

This operatimg system must be optimized! Btw, there are some good options, like 32 bit dos or... DexOs ?
Post 25 Jun 2014, 01:21
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3500
Location: Bulgaria
JohnFound
There is something not clear with the tested code (FreshLib, Sasha's and r22's) - I tried to split the routine in half and reading only one dword in the inner loop. And the result is almost as fast as reading two dwords at time (10% slower). Regardless of the instructions dependencies and parallel execution.
On the other hand, MMX version of StrLen gives me exactly twice the speed of the dword and qword versions.
What is the reason for such strange behavior?
Post 27 Jun 2014, 08:53
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3500
Location: Bulgaria
JohnFound
BTW, here is the MMX code I tested:

Code:
proc StrLenMMX, .ptrString    ; proc StrLen [hString]
begin
        push    ecx edx
        mov     eax, [.ptrString]

        pxor     mm1, mm1
        mov      ecx, eax
        mov      edx, eax
        and      ecx, -8
        and      eax, 7
        movq     mm0, [ecx]
        por      mm0, [.STRINGTBL+eax*8]

.inner:
        add      ecx, 8
        pcmpeqb  mm0, mm1
        packsswb mm0, mm0
        movd     eax, mm0
        movq     mm0, [ecx]
        test     eax, eax
        jz       .inner

        bsf      eax, eax
        shr      eax, 2
        lea      eax, [ecx+eax-8]
        sub      eax, edx

        emms
        pop     edx ecx
        return

.STRINGTBL dq   0, $ff, $ffff, $ffffff, $ffffffff, $ffffffffff, $ffffffffffff, $ffffffffffffff
endp    


Also, here is the question, is it appropriate to use MMX code in a general purpose library as FreshLib.
Post 27 Jun 2014, 10:44
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17287
Location: In your JS exploiting you and your system
revolution
JohnFound wrote:
Also, here is the question, is it appropriate to use MMX code in a general purpose library as FreshLib.
A philosophical question. I think it depends upon your starting premise. If you say that FL requires an MMX capable CPU, and you also specify the register usage requirements, and define what happens to the FPU state and MMX state then sure, got for it.
Post 27 Jun 2014, 10:54
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
JohnFound wrote:

What is the reason for such strange behavior?


The MMX routine has a more compact inner loop (less instructions less overhead), inside the loop only 1 conditional jump is used, and one 8 byte memory read is more efficient than two 4 byte reads.

I think you should go the full way and have an XMMX 16byte at a time version. Using PCMPEQB + PTEST should keep the inner loop tight.
Post 27 Jun 2014, 12:47
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17287
Location: In your JS exploiting you and your system
revolution
r22 wrote:
I think you should go the full way and have an XMMX 16byte at a time version. Using PCMPEQB + PTEST should keep the inner loop tight.
Teaser!
Post 27 Jun 2014, 13:03
View user's profile Send private message Visit poster's website Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3500
Location: Bulgaria
JohnFound
revolution wrote:
r22 wrote:
I think you should go the full way and have an XMMX 16byte at a time version. Using PCMPEQB + PTEST should keep the inner loop tight.
Teaser!


XMMX=SSE?

I want to support in FreshLib at least 98% (+-) of the CPUs running in the moment. But I think MMX is available on (almost) all processors in use today.
Did someone knows about some statistical data on this subject?

_________________
Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9
Post 27 Jun 2014, 16:12
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
@JohnFound - SSE2 has been around since 2001 (Pentium 4) having 13 years of backward compatibility seems reasonable to me.

I couldn't use PTEST (need SSE4 support), but using PMOVMSKB seems to work just fine.
http://ideone.com/A4CwiD

Code:
align 16
strlen_sse:
        mov     ecx, [esp + 4]
        pxor    xmm1, xmm1
        movdqu  xmm0, [ecx]
        mov     edx, ecx
        pcmpeqb xmm0, xmm1
        and     edx, 15
        add     ecx, 16
        pmovmskb eax, xmm0
        and     ecx, -16
        and eax, [STRLENMASK + edx * 4]
        jz .scan
        bsf     eax, eax
        sub     ecx, [esp + 4]
        lea     eax, [ecx + eax - 16]
        ret     4
align 16
.scan:
        movdqa  xmm0, [ecx]
        pcmpeqb xmm0, xmm1
        add     ecx, 16
        pmovmskb        eax, xmm0
        test    eax, eax
        jz      .scan
        bsf     eax, eax
        sub     ecx, [esp + 4]
        lea     eax, [ecx + eax - 16]
        ret 4

align 8
        STRLENMASK      dd      0x0000ffff, 0x00007fff, 0x00003fff, 0x00001fff, 0x00000fff, 0x000007ff, 0x000003ff, 0x000001ff, 0x000000ff, 0x0000007f, 0x0000003f, 0x0000001f, 0x0000000f, 0x00000007, 0x00000003, 0x00000001

    
Post 27 Jun 2014, 17:59
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1180
Location: Unknown
HaHaAnonymous
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 18:08; edited 1 time in total
Post 27 Jun 2014, 21:20
View user's profile Send private message Reply with quote
sid123



Joined: 30 Jul 2013
Posts: 340
Location: Asia, Singapore
sid123
The AMD64 and the System Five Application binary interface make huge use of MMX/SSE while passing floating point arguments. So I think it should be safe to assume that almost every 64-bit processor today supports these extensions.
Post 30 Jun 2014, 05:47
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Unvalidated PTEST version.
Code:
align 16 
strlen_sse: 
        mov     ecx, [esp + 4] 
        movdqa  xmm2, [STRLENPTEST]
        pxor    xmm1, xmm1
        mov     edx, ecx 
        movdqu  xmm0, [ecx] 
        add     ecx, 16
        pcmpeqb xmm0, xmm1 
        and     ecx, -16 
        ptest   xmm0, xmm2
        jz      .scan 
        pmovmskb        eax, xmm0 
        sub     ecx, edx
        bsf     eax, eax 
        lea     eax, [ecx + eax - 16] 
        ret     4 
align 16 
.scan: 
        movdqa  xmm0, [ecx] 
        pcmpeqb xmm0, xmm1 
        add     ecx, 16 
        ptest   xmm0, xmm2
        jz      .scan 
        pmovmskb        eax, xmm0 
        sub     ecx, edx
        bsf     eax, eax 
        lea     eax, [ecx + eax - 16] 
        ret     4 
align 16
        STRLENPTEST     dq      0x8080808080808080, 0x8080808080808080 
    


I think for general purpose use this would probably be the fastest. However, for very short strings ( < ~8bytes) it can't compete with the simple 4byte load version.
Post 30 Jun 2014, 13:58
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
HaHaAnonymous



Joined: 02 Dec 2012
Posts: 1180
Location: Unknown
HaHaAnonymous
[ Post removed by author. ]


Last edited by HaHaAnonymous on 28 Feb 2015, 18:07; edited 1 time in total
Post 30 Jun 2014, 15:06
View user's profile Send private message Reply with quote
donn



Joined: 05 Mar 2010
Posts: 178
donn
Is this approach, using scas,

http://www.int80h.org/strlen/

disregarded due to speed? It seems pretty convenient.
Post 16 Jul 2014, 15:30
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5, 6  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.