flat assembler
Message board for the users of flat assembler.
![]() Goto page Previous 1, 2, 3, 4, 5, 6 Next |
Author |
|
HaHaAnonymous 23 Jun 2014, 00:30
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 18:08; edited 1 time in total |
|||
![]() |
|
revolution 23 Jun 2014, 00:38
If you want to reduce the byte count you can use "or eax,-1" instead of "mov eax,-1".
|
|||
![]() |
|
AsmGuru62 23 Jun 2014, 02:08
Sasha:
Your graphs are looking nice. I just wanted to know how does it compare with other variants. ![]() |
|||
![]() |
|
r22 23 Jun 2014, 14:10
Here's my crack at strlen.
* The alignment loop is only max 3 iterations so unroll it. * No need to 'bytefind' after we know the null is in there just use BSF * I use multiple RET paths rather than try to mangle everything into one (the trade off with call->ret pairing vs better opcode throughput seems to be worth it) Code: strlen_r22: push ebx push esi push edi mov eax, [esp + 16] mov ebx, -01010101h test eax, 3 jz .scan mov edx, [eax] test dl, dl jz .found inc eax test eax, 3 jz .scan test dh, dh jz .found inc eax shr edx, 16 test eax, 3 jz .scan test dl, dl jz .found inc eax jmp .scan align 16 .scan: mov esi, [eax] mov edi, [eax + 4] add eax, 8 lea ecx, [esi + ebx] lea edx, [edi + ebx] not esi not edi and ecx, esi and edx, edi and ecx, 80808080h jnz .foundlo and edx, 80808080h jz .scan .foundhi: bsf edx, edx sub eax, [esp + 16] shr edx, 3 lea eax, [eax + edx - 4] pop edi pop esi pop ebx ret 4 .foundlo: bsf ecx, ecx sub eax, [esp + 16] shr ecx, 3 lea eax, [eax + ecx - 8] pop edi pop esi pop ebx ret 4 .found: sub eax, [esp + 16] pop edi pop esi pop ebx ret 4 Since ideone.com has a NASM compiler I created a kludge benchmark there. http://ideone.com/S6JKdX Last edited by r22 on 23 Jun 2014, 17:38; edited 1 time in total |
|||
![]() |
|
HaHaAnonymous 23 Jun 2014, 17:09
[ Post removed by author. ]
Last edited by HaHaAnonymous on 28 Feb 2015, 18:08; edited 1 time in total |
|||
![]() |
|
JohnFound 24 Jun 2014, 09:09
Why cmpsb? Isn't it scasb?
|
|||
![]() |
|
revolution 24 Jun 2014, 09:11
JohnFound wrote: Why cmpsb? Isn't it scasb? |
|||
![]() |
|
JohnFound 24 Jun 2014, 10:12
revolution wrote: Yes it is. My mistake. Thanks for the correction. I made some tests on scasb and non-scas byte scanning: Code: proc StrLenByte, .ptrString begin mov eax, [.ptrString] dec eax .loop: inc eax cmp byte [eax], 0 jne .loop sub eax, [.ptrString] return endp proc StrLenByteS, .ptrString begin push edi ecx mov edi, [.ptrString] or ecx, -1 xor eax, eax repne scasb lea eax, [edi-1] sub eax, [.ptrString] pop ecx edi return endp The results are showed on the graphic. The lables are: 1. Sasha - the Sasha algorithm. 2. Dword - the same algorithm, but only compares dwords. 3. Byte - the above "StrLenByte" procedure. 4. ByteS - the above "StrLenByteS" procedure (scasb). The X coordinate is the length of the string. The Y coordinate is the time for one execution in [ns] (Actually the time for 1000000 executions in [ms], which is the same). The CPU is Intel Core I3 @3.4GHz, Windows 7 64bit.
_________________ Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9 |
||||||||||
![]() |
|
revolution 24 Jun 2014, 10:17
JohnFound: It would help to know which CPU you are testing with.
Also are all strings already in cache before scanning? |
|||
![]() |
|
JohnFound 24 Jun 2014, 10:28
revolution wrote: JohnFound: It would help to know which CPU you are testing with. The post is edited and the CPU type added. The benchmark creates an array of 1000 strings, with the given length then calls every procedure 1000 times on every string from the array. In the presented run, all strings are aligned on 16 bytes. For every length (from 16 to 1500) new array of strings is created. _________________ Tox ID: 48C0321ADDB2FE5F644BB5E3D58B0D58C35E5BCBC81D7CD333633FEDF1047914A534256478D9 |
|||
![]() |
|
revolution 24 Jun 2014, 10:31
Okay, so the strings are already in cache and the branch predictor will have an easy time building history and exiting without misprediction.
I wonder how different the times would be for uncached strings of [pseudo-]random length (which is probably a better model for a real world application)? |
|||
![]() |
|
JohnFound 24 Jun 2014, 10:35
Well, I tried to create an array with 1000000 unique strings (but still the same length) and the results was pretty the same actually. The size of the data is (on length 1500) around 1.5GBytes, so the strings was guaranteed not in the cache.
|
|||
![]() |
|
r22 24 Jun 2014, 12:42
Unrolling the main loop (4x) in my variation (sasha + agner fog) adds further improvement.
Code: strlen_unrolled: push ebx push esi push edi mov eax, [esp + 16] mov ebx, -01010101h test eax, 3 jz .scan mov edx, [eax] test dl, dl jz .found inc eax test eax, 3 jz .scan test dh, dh jz .found inc eax shr edx, 16 test eax, 3 jz .scan test dl, dl jz .found inc eax jmp .scan .found: sub eax, [esp + 16] pop edi pop esi pop ebx ret 4 align 16 .scan: mov esi, [eax] mov edi, [eax + 4] add eax, 8 lea ecx, [esi + ebx] lea edx, [edi + ebx] not esi not edi and ecx, esi and edx, edi and ecx, 80808080h jnz .foundlo and edx, 80808080h jnz .foundhi mov esi, [eax] mov edi, [eax + 4] add eax, 8 lea ecx, [esi + ebx] lea edx, [edi + ebx] not esi not edi and ecx, esi and edx, edi and ecx, 80808080h jnz .foundlo and edx, 80808080h jnz .foundhi mov esi, [eax] mov edi, [eax + 4] add eax, 8 lea ecx, [esi + ebx] lea edx, [edi + ebx] not esi not edi and ecx, esi and edx, edi and ecx, 80808080h jnz .foundlo and edx, 80808080h jnz .foundhi mov esi, [eax] mov edi, [eax + 4] add eax, 8 lea ecx, [esi + ebx] lea edx, [edi + ebx] not esi not edi and ecx, esi and edx, edi and ecx, 80808080h jnz .foundlo and edx, 80808080h jz .scan .foundhi: bsf edx, edx sub eax, [esp + 16] shr edx, 3 lea eax, [eax + edx - 4] pop edi pop esi pop ebx ret 4 .foundlo: bsf ecx, ecx sub eax, [esp + 16] shr ecx, 3 lea eax, [eax + ecx - 8] pop edi pop esi pop ebx ret 4 http://ideone.com/XyVmKH Depending on average size you're testing for and if you can avoid poking over a memory boundary, using SIMD would probably be faster. The same algorithm would apply just use an unaligned load for the alignment then for the main loop use aligned loads (MOVDQA). PTEST for the comparison and PMOVMSKB + BSF when found. |
|||
![]() |
|
Sasha 24 Jun 2014, 15:30
JohnFound, how do you count the time? I used GetTickCount and a very large number of execution, so it was problematic to test the big ranges. Now I'm porting to rdtsc. It is very accurate, so it makes even more problems. But now, on the large range I see the performance gain for unrolling the loop, wich I didn't saw before. It begins after the length of 512 bytes on my computer. Also, I use the same buffer for tests, now I'll try to create a separate buffer for every execution.
r22, the bsf is the real gem! |
|||
![]() |
|
r22 24 Jun 2014, 15:58
The BSF + SHR usage is from Agner Fog's optimized strlen routine.
If you check the IDEONE link from my last post and check out the "benchmark" and "makestring" procedures they seem to work pretty well. I create 4 copies of the same string with different alignment so that the alignment portion of the code is accounted for. But I think for a big optimized string len you would have aligned buffers already, so maybe the pre alignment stuff is out of scope. |
|||
![]() |
|
AsmGuru62 24 Jun 2014, 16:53
QueryPerformanceCounter() is probably better than GetTickCount().
|
|||
![]() |
|
JohnFound 24 Jun 2014, 16:56
@Sasha: I am counting the time using FreshLib GetTimestamp procedure, that is in Windows: GetTickCount and in Linux: sys_gettimeofday.
Anyway, as long as the test continues long enough (for 1000000 times of execution for every string length and function) the overall CPU load influences the measured values, making the test close to the real life. Actually I am not a fan of optimizing to the last extent, but some of the results are really strange for me. Later, I will publish some results from another CPU: Intel Atom and AMD Sempron, that are even stranger. ![]() |
|||
![]() |
|
Sasha 24 Jun 2014, 17:17
With rdtsc even one execution can be tested. Now I've made a benchmaker, that counts the rdtsc of single execution and the most important it takes the new string and the new instance of the code for every test. I called it "cold benchmark" because of it. Look at the code. Some macros need to be created, if such approach worth it.
This is the output of my core2duo 2.2 Ghz 3927 781 154 550 473 473 451 143 176 176 176 165 528 165 176 352 220 187 220 176 187 198 198 231 242 209 231 253 231 242 165 264 The first number is always much bigger than the rest.
|
|||||||||||
![]() |
|
Sasha 24 Jun 2014, 17:54
JohnFound wrote: Actually I am not a fan of optimizing to the last extent. Yes, and after all this, we still don't know what is better. In real life this function can be called very rare and if so, the optimized by size version can be better.. The only solution, that I see is to test the code on the real tests. For example, compiler can be tested compiling a real code. |
|||
![]() |
|
Goto page Previous 1, 2, 3, 4, 5, 6 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.