flat assembler
Message board for the users of flat assembler.
Index
> Main > String length Goto page Previous 1, 2, 3, 4, 5, 6 Next |
Author |
|
r22 24 Jun 2014, 19:50
@sasha if you are trying to do fine grained timings with RDTSC you will need to do the following:
* Disable turbo boost (usual C-state and power settings in your BIOS) * Try to force the process to the highest priority class * Try to force the executing thread to be scheduled on a single cpu core * Try to force the executing thread to the highest priority level * Execute a warmup loop right before the timing. This tries to force your code into a single scheduled time slice and enable turbo boost/cpu frequency increases On windows something like this... Code: _start: ;; process priority class invoke GetCurrentProcess push 0x100 ;;REALTIME_PRIORITY_CLASS push eax invoke SetPriorityClass ;; thread priority invoke GetCurrentThread push 15 ;;THREAD_PRIORITY_TIME_CRITICAL push eax invoke SetThreadPriority ;; thread cpu core affinity invoke GetCurrentThread push 2 ;;use CPU Core 2 push eax invoke SetThreadAffinityMask ... push strbuf push strlen_optimized call TimeSingleCall ... TimeSingleCall: ;; (func, arg) push ebp push ebx push esi push edi mov ebp, [esp + 20] ;; func mov ebx, [esp + 24] ;; arg ;; xor eax, eax .warmup: add ecx, eax sub eax, 1 jnz .warmup ;; rdtsc mov esi, eax mov edi, edx push ebx call ebp rdtsc sub eax, esi sbb edx, edi ;; pop edi pop esi pop ebx pop ebp ret 8 And even after all of that windows will still schedule and prioritize the way it wants, but the above should be more than close enough. Calling the function you want to test 1000000 times and taking the average will always be the better solution. |
|||
24 Jun 2014, 19:50 |
|
revolution 24 Jun 2014, 20:09
r22 wrote: * Execute a warmup loop right before the timing. |
|||
24 Jun 2014, 20:09 |
|
Sasha 24 Jun 2014, 20:17
revolution, we need this to use rdtsc. As I wrote before It makes more problems, than an inacurate gettickcount. Now I see what could be the reasons.
|
|||
24 Jun 2014, 20:17 |
|
baldr 24 Jun 2014, 21:13
r22 wrote: …Try to… |
|||
24 Jun 2014, 21:13 |
|
r22 25 Jun 2014, 00:35
@revolution - If you are benchmarking a library function with global utility than there would be quite a large set of real use applications to test it within. I think getting the optimal speed at which a procedure can run is a valid metric for comparing it's performance with other similar procedures. I can't think of a situation *for this specific case (strlen)* where a worst case benchmark would be helpful. Now if the procedure we were benchmarking could possibly have large lookup tables than testing against a CPU loaded cache starved scenario would be a valid metric too.
@baldr - Lets just throw together a 32bit real-time operating system to benchmark our string length optimizations Maybe a driver/kernel module would be sufficient, since I'd rather not reboot every time I want to run another benchmark |
|||
25 Jun 2014, 00:35 |
|
revolution 25 Jun 2014, 00:44
r22 wrote: @revolution - If you are benchmarking a library function with global utility than ... I don't subscribe to the "one solution for all situations" paradigm. There needs to be options because all situations are unique and have their own requirements. |
|||
25 Jun 2014, 00:44 |
|
Sasha 25 Jun 2014, 01:21
r22 wrote: @baldr - Lets just throw together a 32bit real-time operating system to benchmark our string length optimizations This operatimg system must be optimized! Btw, there are some good options, like 32 bit dos or... DexOs ? |
|||
25 Jun 2014, 01:21 |
|
JohnFound 27 Jun 2014, 08:53
There is something not clear with the tested code (FreshLib, Sasha's and r22's) - I tried to split the routine in half and reading only one dword in the inner loop. And the result is almost as fast as reading two dwords at time (10% slower). Regardless of the instructions dependencies and parallel execution.
On the other hand, MMX version of StrLen gives me exactly twice the speed of the dword and qword versions. What is the reason for such strange behavior? |
|||
27 Jun 2014, 08:53 |
|
JohnFound 27 Jun 2014, 10:44
BTW, here is the MMX code I tested:
Code: proc StrLenMMX, .ptrString ; proc StrLen [hString] begin push ecx edx mov eax, [.ptrString] pxor mm1, mm1 mov ecx, eax mov edx, eax and ecx, -8 and eax, 7 movq mm0, [ecx] por mm0, [.STRINGTBL+eax*8] .inner: add ecx, 8 pcmpeqb mm0, mm1 packsswb mm0, mm0 movd eax, mm0 movq mm0, [ecx] test eax, eax jz .inner bsf eax, eax shr eax, 2 lea eax, [ecx+eax-8] sub eax, edx emms pop edx ecx return .STRINGTBL dq 0, $ff, $ffff, $ffffff, $ffffffff, $ffffffffff, $ffffffffffff, $ffffffffffffff endp Also, here is the question, is it appropriate to use MMX code in a general purpose library as FreshLib. |
|||
27 Jun 2014, 10:44 |
|
revolution 27 Jun 2014, 10:54
JohnFound wrote: Also, here is the question, is it appropriate to use MMX code in a general purpose library as FreshLib. |
|||
27 Jun 2014, 10:54 |
|
r22 27 Jun 2014, 12:47
JohnFound wrote:
The MMX routine has a more compact inner loop (less instructions less overhead), inside the loop only 1 conditional jump is used, and one 8 byte memory read is more efficient than two 4 byte reads. I think you should go the full way and have an XMMX 16byte at a time version. Using PCMPEQB + PTEST should keep the inner loop tight. |
|||
27 Jun 2014, 12:47 |
|
revolution 27 Jun 2014, 13:03
r22 wrote: I think you should go the full way and have an XMMX 16byte at a time version. Using PCMPEQB + PTEST should keep the inner loop tight. |
|||
27 Jun 2014, 13:03 |
|
JohnFound 27 Jun 2014, 16:12
revolution wrote:
XMMX=SSE? I want to support in FreshLib at least 98% (+-) of the CPUs running in the moment. But I think MMX is available on (almost) all processors in use today. Did someone knows about some statistical data on this subject? _________________ Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9 |
|||
27 Jun 2014, 16:12 |
|
r22 27 Jun 2014, 17:59
@JohnFound - SSE2 has been around since 2001 (Pentium 4) having 13 years of backward compatibility seems reasonable to me.
I couldn't use PTEST (need SSE4 support), but using PMOVMSKB seems to work just fine. http://ideone.com/A4CwiD Code: align 16 strlen_sse: mov ecx, [esp + 4] pxor xmm1, xmm1 movdqu xmm0, [ecx] mov edx, ecx pcmpeqb xmm0, xmm1 and edx, 15 add ecx, 16 pmovmskb eax, xmm0 and ecx, -16 and eax, [STRLENMASK + edx * 4] jz .scan bsf eax, eax sub ecx, [esp + 4] lea eax, [ecx + eax - 16] ret 4 align 16 .scan: movdqa xmm0, [ecx] pcmpeqb xmm0, xmm1 add ecx, 16 pmovmskb eax, xmm0 test eax, eax jz .scan bsf eax, eax sub ecx, [esp + 4] lea eax, [ecx + eax - 16] ret 4 align 8 STRLENMASK dd 0x0000ffff, 0x00007fff, 0x00003fff, 0x00001fff, 0x00000fff, 0x000007ff, 0x000003ff, 0x000001ff, 0x000000ff, 0x0000007f, 0x0000003f, 0x0000001f, 0x0000000f, 0x00000007, 0x00000003, 0x00000001 |
|||
27 Jun 2014, 17:59 |
|
HaHaAnonymous 27 Jun 2014, 21:20
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 18:08; edited 1 time in total |
|||
27 Jun 2014, 21:20 |
|
sid123 30 Jun 2014, 05:47
The AMD64 and the System Five Application binary interface make huge use of MMX/SSE while passing floating point arguments. So I think it should be safe to assume that almost every 64-bit processor today supports these extensions.
|
|||
30 Jun 2014, 05:47 |
|
r22 30 Jun 2014, 13:58
Unvalidated PTEST version.
Code: align 16 strlen_sse: mov ecx, [esp + 4] movdqa xmm2, [STRLENPTEST] pxor xmm1, xmm1 mov edx, ecx movdqu xmm0, [ecx] add ecx, 16 pcmpeqb xmm0, xmm1 and ecx, -16 ptest xmm0, xmm2 jz .scan pmovmskb eax, xmm0 sub ecx, edx bsf eax, eax lea eax, [ecx + eax - 16] ret 4 align 16 .scan: movdqa xmm0, [ecx] pcmpeqb xmm0, xmm1 add ecx, 16 ptest xmm0, xmm2 jz .scan pmovmskb eax, xmm0 sub ecx, edx bsf eax, eax lea eax, [ecx + eax - 16] ret 4 align 16 STRLENPTEST dq 0x8080808080808080, 0x8080808080808080 I think for general purpose use this would probably be the fastest. However, for very short strings ( < ~8bytes) it can't compete with the simple 4byte load version. |
|||
30 Jun 2014, 13:58 |
|
HaHaAnonymous 30 Jun 2014, 15:06
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 18:07; edited 1 time in total |
|||
30 Jun 2014, 15:06 |
|
donn 16 Jul 2014, 15:30
Is this approach, using scas,
http://www.int80h.org/strlen/ disregarded due to speed? It seems pretty convenient. |
|||
16 Jul 2014, 15:30 |
|
Goto page Previous 1, 2, 3, 4, 5, 6 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.