flat assembler
Message board for the users of flat assembler.
![]() Goto page Previous 1, 2, 3, 4, 5, 6 Next |
Author |
|
bitRAKE 19 Aug 2022, 04:11
LAHF and we get the full branch potential.
![]() (Or a single bit shift.) Last edited by bitRAKE on 19 Aug 2022, 04:29; edited 1 time in total |
|||
![]() |
|
revolution 19 Aug 2022, 04:15
I got these four new variants tested on the same box as before.
Code: the_data_pattern_is_unimportant: while % <= 1000 db 1 end while db 2,3,4 the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant Code: ~ fasm -d TEST=four_way_junction test.asm && time ./test flat assembler version 1.73.08 (4042780 kilobytes memory) 2 passes, 1133 bytes. real 0m7.084s user 0m6.976s sys 0m0.016s ~ fasm -d TEST=hack_1234_chain_1 test.asm && time ./test flat assembler version 1.73.08 (4042332 kilobytes memory) 2 passes, 1133 bytes. real 0m6.836s user 0m6.556s sys 0m0.004s ~ fasm -d TEST=hack_1234_chain_2 test.asm && time ./test flat assembler version 1.73.08 (4032536 kilobytes memory) 2 passes, 1133 bytes. real 0m6.561s user 0m6.412s sys 0m0.004s ~ fasm -d TEST=hack_1234_chain_3 test.asm && time ./test flat assembler version 1.73.08 (4048760 kilobytes memory) 2 passes, 1133 bytes. real 0m6.483s user 0m6.408s sys 0m0.000s Code: the_data_pattern_is_unimportant: while % <= 1000 db 4 end while db 2,3,1 the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant Code: ~ fasm -d TEST=four_way_junction test.asm && time ./test flat assembler version 1.73.08 (4032044 kilobytes memory) 2 passes, 1133 bytes. real 0m5.167s user 0m5.092s sys 0m0.004s ~ fasm -d TEST=hack_1234_chain_1 test.asm && time ./test flat assembler version 1.73.08 (4043256 kilobytes memory) 2 passes, 1133 bytes. real 0m6.823s user 0m6.776s sys 0m0.000s ~ fasm -d TEST=hack_1234_chain_2 test.asm && time ./test flat assembler version 1.73.08 (4026552 kilobytes memory) 2 passes, 1133 bytes. real 0m6.892s user 0m6.768s sys 0m0.016s ~ fasm -d TEST=hack_1234_chain_3 test.asm && time ./test flat assembler version 1.73.08 (4042588 kilobytes memory) 2 passes, 1133 bytes. real 0m6.873s user 0m6.784s sys 0m0.000s |
|||
![]() |
|
Tomasz Grysztar 19 Aug 2022, 07:39
revolution wrote: None of those beat the "useless_cmp_chain". And again the times change depending upon the data pattern. ![]() |
|||
![]() |
|
revolution 19 Aug 2022, 07:48
I do hope that everyone realises that these synthetic tests are only valid for the actual code shown, on the actual system it ran on, with the actual data used. That is, they shouldn't be used to make decisions on some other code, on some other system, on different data.
It would be instructive to have results from the same code and data above, but on different systems. I expect results to be all over the place, with "faster" code becoming "slower", on some other systems. |
|||
![]() |
|
revolution 19 Aug 2022, 08:15
I notice that the "fastest" code is also the largest.
Perhaps we should optimise for anti-size, and then get the fastest. ![]() |
|||
![]() |
|
FlierMate1 19 Aug 2022, 08:34
Tomasz Grysztar wrote: Side-stepping the topic a little, this could be made into a fun 4-way junction: Nice!!! JA=Jump if above 3, i.e. 4 JE=Jump if equals 3 JPE=Jump if parity even, in this case, 2 JMP=Jump, when value is 1 |
|||
![]() |
|
Tomasz Grysztar 19 Aug 2022, 08:58
FlierMate1 wrote: JPE=Jump if parity even, in this case, 2 CMP is the same as SUB, except it does not update the destination register with the result, only flags. If we subtract 3 from 2, we get -1, which is 11111111b and has an even number of 1s. This allows to differentiate it from 1 - 3 = -2, which is 11111110b and has an odd number of 1s. revolution wrote: I do hope that everyone realises that these synthetic tests are only valid for the actual code shown, on the actual system it ran on, with the actual data used. That is, they shouldn't be used to make decisions on some other code, on some other system, on different data. |
|||
![]() |
|
Overclick 19 Aug 2022, 09:40
Code: jmp [adritem0+rax*8] adritem0 dq item0 dq item1 dq item2 dq item3 dq item4 dq item_n ... item0: ... item1: ... item2: ... item3: ... item4: ... Code: shl rax,5 add rax,item0 jmp rax align 32 item0: ... align 32 item1: ... align 32 item2: ... align 32 item3: ... align 32 item4: ... |
|||
![]() |
|
FlierMate1 19 Aug 2022, 11:44
I test C# switch statement with Compiler Explorer website, and I found that .NET 6.0.101 uses lookup table (LUT?) to evaluate the value.
Code: grade=Console.ReadLine()[0]; switch (grade) { case 'A': Console.WriteLine("Excellent!"); break; case 'B': case 'C': Console.WriteLine("Well done"); break; case 'D': Console.WriteLine("You passed"); break; case 'F': Console.WriteLine("Better try again"); break; default: Console.WriteLine("Invalid grade"); break; } The HLL code above is translated to the following: Code: G_M21826_IG02: ... ... mov edi, edi lea rax, [reloc @RWD00] mov eax, dword ptr [rax+4*rdi] lea rsi, G_M21826_IG02 add rax, rsi jmp rax G_M21826_IG03: .... G_M21826_IG04: .... G_M21826_IG05: .... G_M21826_IG06: .... G_M21826_IG07: .... RWD00 dd G_M21826_IG03 - G_M21826_IG02 dd G_M21826_IG04 - G_M21826_IG02 ;grade=B? dd G_M21826_IG04 - G_M21826_IG02 ;grade=C? dd G_M21826_IG05 - G_M21826_IG02 dd G_M21826_IG07 - G_M21826_IG02 dd G_M21826_IG06 - G_M21826_IG02 Oh, it doesn't use repetitive CMP for each "switch case". If you want to look at the full disassembly, click here: https://godbolt.org/z/xbW5GWvcq |
|||
![]() |
|
revolution 19 Aug 2022, 12:10
Thrre more meaningless data points. But still can't beat the useless_cmp_chain. What am I doing wrong? I can't find the "obvious" solution here.
![]() Code: macro cmov_for_the_win { cmp al,1 mov edx,item1 cmovz edi,edx cmp al,2 mov edx,item2 cmovz edi,edx cmp al,3 mov edx,item3 cmovz edi,edx cmp al,4 mov edx,item4 cmovz edi,edx jmp edi } macro LUTs_are_awesome { movzx eax,al jmp dword[LUT+(eax-1)*4] align 4 LUT: dd item1,item2,item3,item4 } macro computed_goto { movzx eax,al lea eax,[(eax-1)*(item2-item1)+item1] jmp eax } Code: ~ fasm -d TEST=cmov_for_the_win test.asm && time ./test flat assembler version 1.73.08 (4044352 kilobytes memory) 2 passes, 1165 bytes. real 0m7.130s user 0m7.012s sys 0m0.004s ~ fasm -d TEST=LUTs_are_awesome test.asm && time ./test flat assembler version 1.73.08 (4036084 kilobytes memory) 2 passes, 1151 bytes. real 0m6.460s user 0m6.376s sys 0m0.000s ~ fasm -d TEST=computed_goto test.asm && time ./test flat assembler version 1.73.08 (4045368 kilobytes memory) 3 passes, 1135 bytes. real 0m6.427s user 0m6.352s sys 0m0.000s |
|||
![]() |
|
Furs 19 Aug 2022, 13:15
revolution wrote: Show your improvement. Code: cmp al,3 jb @f je item3 item4: jmp next_item @@: cmp al,1 je item1 item2: jmp next_item BTW I purposefully made it so it takes more jumps for 1-2 case, if I wanted to optimize it for that pattern I'd do it the other way around. |
|||
![]() |
|
revolution 19 Aug 2022, 13:28
Thanks. I will pass that on to be tested.
Furs wrote: ... if I wanted to optimize it for that pattern I'd do it the other way around. |
|||
![]() |
|
bitRAKE 19 Aug 2022, 13:34
revolution wrote: Perhaps we should optimise for anti-size, and then get the fastest. * In the scope of all problem spaces. |
|||
![]() |
|
Furs 20 Aug 2022, 16:05
revolution wrote: Thanks. I will pass that on to be tested. In fact, I pointed exceptions where binary search is worse (e.g. even said up to 4 to use linear in most cases), but the problem is that if your data is skewed, you should really be using something else like RLE (run-length encoding), where you loop over repeated data, which will be much faster (and take much less space to encode, too). So, linear search is bad except when data statistical distribution is mostly random, and you have < 5 choices or so (rule of thumb, not exact, but it's around there). |
|||
![]() |
|
bitRAKE 20 Aug 2022, 17:26
Henri Poincaré, 1899 wrote: Logic sometimes makes monsters. For half a century we have seen a mass of bizarre functions which appear to be forced to resemble as little as possible honest functions which serve some purpose. More of continuity, or less of continuity, more derivatives, and so forth. Indeed, from the point of view of logic, these strange functions are the most general; on the other hand those which one meets without searching for them, and which follow simple laws appear as a particular case which does not amount to more than a small corner. |
|||
![]() |
|
revolution 20 Aug 2022, 22:46
Furs wrote: ... it matters if it's extremely skewed. |
|||
![]() |
|
revolution 21 Aug 2022, 02:38
My colleague came back to me and she commented:
Quote: Why does Furs keep moving the goalposts? |
|||
![]() |
|
revolution 21 Aug 2022, 04:01
We couldn't decode Furs' statement about "skewed", it would appear to be a subjective assessment, so we tried to normalise the data.
Code: the_data_pattern_is_unimportant: while % <= 250 db 1,2,3,4 end while the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant Code: ~ fasm -d TEST=uselss_cmp_chain test.asm && time ./test flat assembler version 1.73.08 (4053804 kilobytes memory) 2 passes, 1139 bytes. real 0m4.551s user 0m4.460s sys 0m0.008s ~ fasm -d TEST=objectively_better_binary test.asm && time ./test flat assembler version 1.73.08 (4040788 kilobytes memory) 2 passes, 1137 bytes. real 0m5.197s user 0m5.112s sys 0m0.000s ~ fasm -d TEST=four_way_junction test.asm && time ./test flat assembler version 1.73.08 (4037252 kilobytes memory) 2 passes, 1133 bytes. real 0m7.139s user 0m7.004s sys 0m0.000s ~ fasm -d TEST=hack_1234_chain_1 test.asm && time ./test flat assembler version 1.73.08 (4022056 kilobytes memory) 2 passes, 1133 bytes. real 0m6.521s user 0m6.416s sys 0m0.000s ~ fasm -d TEST=hack_1234_chain_2 test.asm && time ./test flat assembler version 1.73.08 (4053372 kilobytes memory) 2 passes, 1133 bytes. real 0m6.505s user 0m6.416s sys 0m0.000s ~ fasm -d TEST=hack_1234_chain_3 test.asm && time ./test flat assembler version 1.73.08 (4028416 kilobytes memory) 2 passes, 1133 bytes. real 0m6.508s user 0m6.424s sys 0m0.004s ~ fasm -d TEST=cmov_for_the_win test.asm && time ./test flat assembler version 1.73.08 (4038796 kilobytes memory) 2 passes, 1165 bytes. real 0m7.195s user 0m7.016s sys 0m0.004s ~ fasm -d TEST=LUTs_are_awesome test.asm && time ./test flat assembler version 1.73.08 (4025676 kilobytes memory) 2 passes, 1151 bytes. real 0m6.490s user 0m6.364s sys 0m0.004s ~ fasm -d TEST=computed_goto test.asm && time ./test flat assembler version 1.73.08 (4038980 kilobytes memory) 3 passes, 1135 bytes. real 0m6.467s user 0m6.352s sys 0m0.004s ~ fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test flat assembler version 1.73.08 (4053596 kilobytes memory) 2 passes, 1137 bytes. real 0m7.146s user 0m7.008s sys 0m0.004s Code: macro uber_no_need_to_test_code_from_Furs { cmp al,3 jb @f je item3 item4x: jmp next_item @@: cmp al,1 je item1 item2x: jmp next_item } Furs wrote: Some things are just obvious. |
|||
![]() |
|
macomics 21 Aug 2022, 06:17
Could you attach all the macros that were tested with a file?
|
|||
![]() |
|
Goto page Previous 1, 2, 3, 4, 5, 6 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.