flat assembler
Message board for the users of flat assembler.
Index
> Main > 256 possible order. What is the fastest code? Goto page Previous 1, 2, 3, 4, 5, 6 Next |
Author |
|
Furs 21 Aug 2022, 14:08
revolution wrote: My colleague came back to me and she commented: Second post in this thread: https://board.flatassembler.net/topic.php?p=224292#224292 (skewed data) I'm moving the goal posts I stated originally? Tell your colleague that she won the retarded troll award from me. Anyway, I cannot take your test seriously, either because you screwed something up (e.g. power governor? background tasks?), or because the whole workload is a jmp, obviously skewing something here. There is no way that fall through performs worse than extra jmps. Or it's alignment or something (in which case yours works by "luck" only better). Period. |
|||
21 Aug 2022, 14:08 |
|
revolution 22 Aug 2022, 09:14
macomics wrote: Could you attach all the macros that were tested with a file? Code: rept 0 { fasm -d TEST=uselss_cmp_chain test.asm && time ./test fasm -d TEST=objectively_better_binary test.asm && time ./test fasm -d TEST=four_way_junction test.asm && time ./test fasm -d TEST=hack_1234_chain_1 test.asm && time ./test fasm -d TEST=hack_1234_chain_2 test.asm && time ./test fasm -d TEST=hack_1234_chain_3 test.asm && time ./test fasm -d TEST=cmov_for_the_win test.asm && time ./test fasm -d TEST=LUTs_are_awesome test.asm && time ./test fasm -d TEST=computed_goto test.asm && time ./test fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test } macro uselss_cmp_chain { cmp al,1 jz item1 cmp al,2 jz item2 cmp al,3 jz item3 cmp al,4 jz item4 } macro objectively_better_binary { cmp al,2 ja high_half jb item1 jmp item2 high_half: cmp al,3 je item3 jmp item4 } macro four_way_junction { ; assume AL is in range 1-4 cmp al,3 ja item4 je item3 jpe item2 jmp item1 } macro hack_1234_chain_1 { sub al,3 ;add al,-3 jz item3 jp item2 js item1 jmp item4 } macro hack_1234_chain_2 { add al,-3 jz item3 jp item2 ja item1 jmp item4 } macro hack_1234_chain_3 { sub al,3 jz item3 jp item2 jb item1 jmp item4 } macro cmov_for_the_win { cmp al,1 mov edx,item1 cmovz edi,edx cmp al,2 mov edx,item2 cmovz edi,edx cmp al,3 mov edx,item3 cmovz edi,edx cmp al,4 mov edx,item4 cmovz edi,edx jmp edi } macro LUTs_are_awesome { movzx eax,al jmp dword[LUT+(eax-1)*4] align 4 LUT: dd item1,item2,item3,item4 } macro computed_goto { movzx eax,al lea eax,[(eax-1)*(item2-item1)+item1] jmp eax } macro uber_no_need_to_test_code_from_Furs { cmp al,3 jb @f je item3 item4x: jmp next_item @@: cmp al,1 je item1 item2x: jmp next_item } format elf executable SYS_EXIT = 1 mov ecx,1000000 outer_loop: mov ebx,the_data_pattern_is_unimportant_len mov esi,the_data_pattern_is_unimportant inner_loop: lodsb match x,TEST {x} next_item: dec ebx jnz inner_loop loop outer_loop mov eax,SYS_EXIT int 0x80 item1: jmp next_item item2: jmp next_item item3: jmp next_item item4: jmp next_item the_data_pattern_is_unimportant: while % <= 1000 db 1 end while db 2,3,4 the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant |
|||
22 Aug 2022, 09:14 |
|
macomics 22 Aug 2022, 10:43
Thanks
ADD: left: Code: rept 0 { fasm -d TEST=uselss_cmp_chain test.asm && time ./test fasm -d TEST=objectively_better_binary test.asm && time ./test fasm -d TEST=four_way_junction test.asm && time ./test fasm -d TEST=hack_1234_chain_1 test.asm && time ./test fasm -d TEST=hack_1234_chain_2 test.asm && time ./test fasm -d TEST=hack_1234_chain_3 test.asm && time ./test fasm -d TEST=cmov_for_the_win test.asm && time ./test fasm -d TEST=LUTs_are_awesome test.asm && time ./test fasm -d TEST=computed_goto test.asm && time ./test fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test } macro uselss_cmp_chain { cmp al,1 jz item1 cmp al,2 jz item2 cmp al,3 jz item3 cmp al,4 jz item4 } macro objectively_better_binary { cmp al,2 ja high_half jb item1 jmp item2 high_half: cmp al,3 je item3 jmp item4 } macro four_way_junction { ; assume AL is in range 1-4 cmp al,3 ja item4 je item3 jpe item2 jmp item1 } macro hack_1234_chain_1 { sub al,3 ;add al,-3 jz item3 jp item2 js item1 jmp item4 } macro hack_1234_chain_2 { add al,-3 jz item3 jp item2 ja item1 jmp item4 } macro hack_1234_chain_3 { sub al,3 jz item3 jp item2 jb item1 jmp item4 } macro cmov_for_the_win { cmp al,1 mov edx,item1 cmovz edi,edx cmp al,2 mov edx,item2 cmovz edi,edx cmp al,3 mov edx,item3 cmovz edi,edx cmp al,4 mov edx,item4 cmovz edi,edx jmp edi } macro LUTs_are_awesome { movzx eax,al jmp dword[LUT+(eax-1)*4] align 4 LUT: dd item1,item2,item3,item4 } macro computed_goto { movzx eax,al lea eax,[(eax-1)*(item2-item1)+item1] jmp eax } macro uber_no_need_to_test_code_from_Furs { cmp al,3 jb @f je item3 item4x: jmp next_item @@: cmp al,1 je item1 item2x: jmp next_item } format elf executable SYS_EXIT = 1 mov ecx,1000000 outer_loop: mov ebx,the_data_pattern_is_unimportant_len mov esi,the_data_pattern_is_unimportant inner_loop: lodsb match x,TEST {x} next_item: dec ebx jnz inner_loop loop outer_loop mov eax,SYS_EXIT int 0x80 item1: jmp next_item item2: jmp next_item item3: jmp next_item item4: jmp next_item the_data_pattern_is_unimportant: while % <= 1000 db 1 end while db 2,3,4 the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant right: Code: rept 0 { fasm -d TEST=uselss_cmp_chain test.asm && time ./test fasm -d TEST=objectively_better_binary test.asm && time ./test fasm -d TEST=four_way_junction test.asm && time ./test fasm -d TEST=hack_1234_chain_1 test.asm && time ./test fasm -d TEST=hack_1234_chain_2 test.asm && time ./test fasm -d TEST=hack_1234_chain_3 test.asm && time ./test fasm -d TEST=cmov_for_the_win test.asm && time ./test fasm -d TEST=LUTs_are_awesome test.asm && time ./test fasm -d TEST=computed_goto test.asm && time ./test fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test } macro uselss_cmp_chain { cmp al,1 jz item1 cmp al,2 jz item2 cmp al,3 jz item3 cmp al,4 jz item4 } macro objectively_better_binary { cmp al,2 ja high_half jb item1 jmp item2 high_half: cmp al,3 je item3 jmp item4 } macro four_way_junction { ; assume AL is in range 1-4 cmp al,3 ja item4 je item3 jpe item2 jmp item1 } macro hack_1234_chain_1 { sub al,3 ;add al,-3 jz item3 jp item2 js item1 jmp item4 } macro hack_1234_chain_2 { add al,-3 jz item3 jp item2 ja item1 jmp item4 } macro hack_1234_chain_3 { sub al,3 jz item3 jp item2 jb item1 jmp item4 } macro cmov_for_the_win { cmp al,1 mov edx,item1 cmovz edi,edx cmp al,2 mov edx,item2 cmovz edi,edx cmp al,3 mov edx,item3 cmovz edi,edx cmp al,4 mov edx,item4 cmovz edi,edx jmp edi } macro LUTs_are_awesome { movzx eax,al jmp dword[LUT+(eax-1)*4] align 4 LUT: dd item1,item2,item3,item4 } macro computed_goto { movzx eax,al lea eax,[(eax-1)*(item2-item1)+item1] jmp eax } macro uber_no_need_to_test_code_from_Furs { cmp al,3 jb @f je item3 item4x: jmp next_item @@: cmp al,1 je item1 item2x: jmp next_item } format elf executable SYS_EXIT = 1 jmp skip_jumps item1: jmp next_item item2: jmp next_item item3: jmp next_item item4: jmp next_item skip_jumps: mov ecx,10000000 outer_loop: mov ebx,the_data_pattern_is_unimportant_len mov esi,the_data_pattern_is_unimportant inner_loop: lodsb match x,TEST { x } next_item: dec ebx jnz inner_loop loop outer_loop mov eax,SYS_EXIT int 0x80 the_data_pattern_is_unimportant: while % <= 1000 db 1 end while db 2,3,4 the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant Code: forward jumps backward jumps fasm -d TEST=uselss_cmp_chain test.asm && time ./test 2 passes, 1139 bytes. 2 passes, 1141 bytes. 18.89 user 32.81 user 0.57 system 0.01 system 0:19.48 elapsed 0:32.85 elapsed fasm -d TEST=objectively_better_binary test.asm && time ./test 2 passes, 1137 bytes. 2 passes, 1139 bytes. 18.72 user 19.26 user 0.75 system 0.00 system 0:19.50 elapsed 0:19.28 elapsed fasm -d TEST=four_way_junction test.asm && time ./test 2 passes, 1133 bytes. 2 passes, 1135 bytes. 27.99 user 28.17 user 0.00 system 0.55 system 0:28.02 elapsed 0:28.75 elapsed fasm -d TEST=hack_1234_chain_1 test.asm && time ./test 2 passes, 1133 bytes. 2 passes, 1135 bytes. 23.61 user 23.52 user 0.78 system 0.47 system 0:24.41 elapsed 0:24.02 elapsed fasm -d TEST=hack_1234_chain_2 test.asm && time ./test 2 passes, 1133 bytes. 2 passes, 1135 bytes. 23.45 user 23.61 user 0.48 system 0.45 system 0:23.96 elapsed 0:24.07 elapsed fasm -d TEST=hack_1234_chain_3 test.asm && time ./test 2 passes, 1133 bytes. 2 passes, 1135 bytes. 23.46 user 23.64 user 0.36 system 0.32 system 0:23.84 elapsed 0:23.98 elapsed fasm -d TEST=cmov_for_the_win test.asm && time ./test 2 passes, 1165 bytes. 2 passes, 1167 bytes. 32.76 user 32.65 user 0.58 system 0.48 system 0:33.37 elapsed 0:33.16 elapsed fasm -d TEST=LUTs_are_awesome test.asm && time ./test 2 passes, 1151 bytes. 2 passes, 1151 bytes. 18.86 user 18.80 user 0.00 system 0.34 system 0:18.88 elapsed 0:19.16 elapsed fasm -d TEST=computed_goto test.asm && time ./test 3 passes, 1135 bytes. 2 passes, 1137 bytes. 18.75 user 18.77 user 0.00 system 0.28 system 0:18.77 elapsed 0:19.07 elapsed fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test 2 passes, 1137 bytes. 2 passes, 1139 bytes. 23.11 user 37.48 user 0.00 system 0.56 system 0:23.13 elapsed 0:38.07 elapsed |
|||
22 Aug 2022, 10:43 |
|
macomics 22 Aug 2022, 12:00
These values are given out by your code without changes
Code: fasm -d TEST=uselss_cmp_chain test.asm && time ./test 2 passes, 1139 bytes. 1.89 user 0.00 system 0:01.90 elapsed fasm -d TEST=objectively_better_binary test.asm && time ./test 2 passes, 1137 bytes. 1.87 user 0.00 system 0:01.87 elapsed fasm -d TEST=four_way_junction test.asm && time ./test 2 passes, 1133 bytes. 2.79 user 0.00 system 0:02.80 elapsed fasm -d TEST=hack_1234_chain_1 test.asm && time ./test 2 passes, 1133 bytes. 2.37 user 0.03 system 0:02.41 elapsed fasm -d TEST=hack_1234_chain_2 test.asm && time ./test 2 passes, 1133 bytes. 2.36 user 0.00 system 0:02.36 elapsed fasm -d TEST=hack_1234_chain_3 test.asm && time ./test 2 passes, 1133 bytes. 2.36 user 0.00 system 0:02.36 elapsed fasm -d TEST=cmov_for_the_win test.asm && time ./test 2 passes, 1165 bytes. 3.27 user 0.00 system 0:03.27 elapsed fasm -d TEST=LUTs_are_awesome test.asm && time ./test 2 passes, 1151 bytes. 1.88 user 0.00 system 0:01.88 elapsed fasm -d TEST=computed_goto test.asm && time ./test 3 passes, 1135 bytes. 1.87 user 0.00 system 0:01.88 elapsed fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test 2 passes, 1137 bytes. 2.30 user 0.00 system 0:02.31 elapsed |
|||
22 Aug 2022, 12:00 |
|
Furs 22 Aug 2022, 12:54
See? 1.89 vs 1.87 is pretty much the same thing (even in favor of binary search but w/e), suggesting the rule of thumb of 4 entries being the cutoff point that I posted originally in my very first post in this thread, as advice, seems to be actually useful. Shocking. But yeah I'm moving goal posts.
Anyway, when people ask for "what is fastest" they generally ask for advice on rule of thumbs. Testing it is fine, but since usually such tests are artificial and depend on the system in question, unless the difference is drastic, it's simply not good advice. When people code, they just want to write the code. They ask it because they just want to know how to start it. And applying some simple design decisions/optimizations as you write is good enough, without having to test it for hours on different systems. If you really care about a particular hot spot, yeah, go test it, but in 99% of cases, apply rule of thumbs and "obvious" optimizations. That's far more useful for people who ask this question, as advice. Telling them to "test" is usually the same as not answering at all. Also as you said, data pattern matters. But unless you know the data the user is going to use, testing on skewed data is anti-productive in regards to making a decision. You test on skewed data, sure, to find the pathological worst case. And in that case, binary search will always be better (if entries are large enough, i.e. the 4 cutoff point) because linear search has very bad worst case when data is skewed in opposite direction. You don't know the data, don't use specific data as point. Imagine someone asking "what should i use, shift or mul by power of 2?" and your answer is "test it". |
|||
22 Aug 2022, 12:54 |
|
Fastestcodes 22 Aug 2022, 14:44
ProMiNick wrote:
It works. Thx all. section'.data' jtbl dd La00,La01...Laff movzx eax,byte[esi] |
|||
22 Aug 2022, 14:44 |
|
macomics 22 Aug 2022, 15:00
I also wanted to find out how the above macros work if all 4 labels are involved in the tests equally.
Code: forward jumps backward jumps fasm -d TEST=uselss_cmp_chain test.asm 2 passes, 1136 bytes. 2 passes, 1138 bytes. 27.65 user 1:24.48 user 0.00 system 0.06 system 0:27.67 elapsed 1:24.58 elapsed fasm -d TEST=objectively_better_binary test.asm 2 passes, 1134 bytes. 2 passes, 1136 bytes. 32.23 user 2:15.47 user 0.02 system 0.00 system 0:32.28 elapsed 2:15.55 elapsed fasm -d TEST=four_way_junction test.asm 2 passes, 1130 bytes. 2 passes, 1132 bytes. 27.72 user 1:17.49 user 0.00 system 0.00 system 0:27.74 elapsed 1:17.54 elapsed fasm -d TEST=hack_1234_chain_1 test.asm 2 passes, 1130 bytes. 2 passes, 1132 bytes. 27.72 user 1:22.72 user 0.00 system 0.00 system 0:27.74 elapsed 1:22.77 elapsed fasm -d TEST=hack_1234_chain_2 test.asm 2 passes, 1130 bytes. 2 passes, 1132 bytes. 27.70 user 1:22.08 user 0.01 system 0.89 system 0:27.74 elapsed 1:23.03 elapsed fasm -d TEST=hack_1234_chain_3 test.asm 2 passes, 1130 bytes. 2 passes, 1132 bytes. 27.71 user 1:21.56 user 0.00 system 1.68 system 0:27.73 elapsed 1:23.36 elapsed fasm -d TEST=cmov_for_the_win test.asm 2 passes, 1162 bytes. 2 passes, 1164 bytes. 32.37 user 32.33 user 0.00 system 1.11 system 0:32.40 elapsed 0:33.48 elapsed fasm -d TEST=LUTs_are_awesome test.asm 2 passes, 1148 bytes. 2 passes, 1148 bytes. 27.65 user 27.70 user 0.00 system 0.96 system 0:27.67 elapsed 0:28.70 elapsed fasm -d TEST=computed_goto test.asm 3 passes, 1132 bytes. 2 passes, 1134 bytes. 27.63 user 27.68 user 0.01 system 1.02 system 0:27.66 elapsed 0:28.75 elapsed fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm 2 passes, 1134 bytes. 2 passes, 1136 bytes. 25.34 user 2:00.77 user 0.01 system 2.09 system 0:25.37 elapsed 2:02.99 elapsed left: Code: rept 0 { fasm -d TEST=uselss_cmp_chain test.asm && time ./test fasm -d TEST=objectively_better_binary test.asm && time ./test fasm -d TEST=four_way_junction test.asm && time ./test fasm -d TEST=hack_1234_chain_1 test.asm && time ./test fasm -d TEST=hack_1234_chain_2 test.asm && time ./test fasm -d TEST=hack_1234_chain_3 test.asm && time ./test fasm -d TEST=cmov_for_the_win test.asm && time ./test fasm -d TEST=LUTs_are_awesome test.asm && time ./test fasm -d TEST=computed_goto test.asm && time ./test fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test } macro uselss_cmp_chain { cmp al,1 jz item1 cmp al,2 jz item2 cmp al,3 jz item3 cmp al,4 jz item4 } macro objectively_better_binary { cmp al,2 ja high_half jb item1 jmp item2 high_half: cmp al,3 je item3 jmp item4 } macro four_way_junction { ; assume AL is in range 1-4 cmp al,3 ja item4 je item3 jpe item2 jmp item1 } macro hack_1234_chain_1 { sub al,3 ;add al,-3 jz item3 jp item2 js item1 jmp item4 } macro hack_1234_chain_2 { add al,-3 jz item3 jp item2 ja item1 jmp item4 } macro hack_1234_chain_3 { sub al,3 jz item3 jp item2 jb item1 jmp item4 } macro cmov_for_the_win { cmp al,1 mov edx,item1 cmovz edi,edx cmp al,2 mov edx,item2 cmovz edi,edx cmp al,3 mov edx,item3 cmovz edi,edx cmp al,4 mov edx,item4 cmovz edi,edx jmp edi } macro LUTs_are_awesome { movzx eax,al jmp dword[LUT+(eax-1)*4] align 4 LUT: dd item1,item2,item3,item4 } macro computed_goto { movzx eax,al lea eax,[(eax-1)*(item2-item1)+item1] jmp eax } macro uber_no_need_to_test_code_from_Furs { cmp al,3 jb @f je item3 item4x: jmp next_item @@: cmp al,1 je item1 item2x: jmp next_item } format elf executable SYS_EXIT = 1 mov ecx,10000000 outer_loop: mov ebx,the_data_pattern_is_unimportant_len mov esi,the_data_pattern_is_unimportant inner_loop: lodsb match x,TEST {x} next_item: dec ebx jnz inner_loop loop outer_loop mov eax,SYS_EXIT int 0x80 item1: jmp next_item item2: jmp next_item item3: jmp next_item item4: jmp next_item the_data_pattern_is_unimportant: while % <= 250 db 1,3,4,2 end while the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant Code: rept 0 { fasm -d TEST=uselss_cmp_chain test.asm && time ./test fasm -d TEST=objectively_better_binary test.asm && time ./test fasm -d TEST=four_way_junction test.asm && time ./test fasm -d TEST=hack_1234_chain_1 test.asm && time ./test fasm -d TEST=hack_1234_chain_2 test.asm && time ./test fasm -d TEST=hack_1234_chain_3 test.asm && time ./test fasm -d TEST=cmov_for_the_win test.asm && time ./test fasm -d TEST=LUTs_are_awesome test.asm && time ./test fasm -d TEST=computed_goto test.asm && time ./test fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test } macro uselss_cmp_chain { cmp al,1 jz item1 cmp al,2 jz item2 cmp al,3 jz item3 cmp al,4 jz item4 } macro objectively_better_binary { cmp al,2 ja high_half jb item1 jmp item2 high_half: cmp al,3 je item3 jmp item4 } macro four_way_junction { ; assume AL is in range 1-4 cmp al,3 ja item4 je item3 jpe item2 jmp item1 } macro hack_1234_chain_1 { sub al,3 ;add al,-3 jz item3 jp item2 js item1 jmp item4 } macro hack_1234_chain_2 { add al,-3 jz item3 jp item2 ja item1 jmp item4 } macro hack_1234_chain_3 { sub al,3 jz item3 jp item2 jb item1 jmp item4 } macro cmov_for_the_win { cmp al,1 mov edx,item1 cmovz edi,edx cmp al,2 mov edx,item2 cmovz edi,edx cmp al,3 mov edx,item3 cmovz edi,edx cmp al,4 mov edx,item4 cmovz edi,edx jmp edi } macro LUTs_are_awesome { movzx eax,al jmp dword[LUT+(eax-1)*4] align 4 LUT: dd item1,item2,item3,item4 } macro computed_goto { movzx eax,al lea eax,[(eax-1)*(item2-item1)+item1] jmp eax } macro uber_no_need_to_test_code_from_Furs { cmp al,3 jb @f je item3 item4x: jmp next_item @@: cmp al,1 je item1 item2x: jmp next_item } format elf executable SYS_EXIT = 1 jmp skip_jumps item1: jmp next_item item2: jmp next_item item3: jmp next_item item4: jmp next_item skip_jumps: mov ecx,10000000 outer_loop: mov ebx,the_data_pattern_is_unimportant_len mov esi,the_data_pattern_is_unimportant inner_loop: lodsb match x,TEST { x } next_item: dec ebx jnz inner_loop loop outer_loop mov eax,SYS_EXIT int 0x80 the_data_pattern_is_unimportant: while % <= 250 db 1,3,4,2 end while the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant Total: It's best to LUT or calculate the jump address. They hold various patterns the most stable and do not stray to predict transitions. |
|||
22 Aug 2022, 15:00 |
|
revolution 22 Aug 2022, 22:14
macomics wrote: Total: It's best to LUT or calculate the jump address. They hold various patterns the most stable and do not stray to predict transitions. All the talk from Furs about how it is "obvious" and doesn't need testing it clearly just a troll, like my colleague suggested previously. Thanks go to Furs for highlighting the need to test our code with different paths to make sure we are getting the results we want. |
|||
22 Aug 2022, 22:14 |
|
macomics 23 Aug 2022, 00:01
Code: macro LUTs_are_awesome { movzx eax,al jmp dword[LUT+(eax-1)*4] align 4 LUT: dd item1,item2,item3,item4 } macro LUTs_are_awesome16 { movzx eax,al jmp dword[LUT+(eax-1)*4] align 16 LUT: dd item1,item2,item3,item4 } macro LUTs_are_awesome_unaligned { movzx eax,al jmp dword[LUT+(eax-1)*4] align 16 db 1 LUT: dd item1,item2,item3,item4 } Code: 2000000 times pattern 1,3,4,2 pattern 1,3,4,2,1,2,3,4 forward jumps backward jumps forward jumps backward jumps ;-------------------------------------------------------------------------------------------- fasm -d TEST=LUTs_are_awesome test.asm 2 passes, 1148 bytes. 2 passes, 1148 bytes. 2 passes, 1148 bytes. 2 passes, 1148 bytes. 5.51 user 5.53 user 8.60 user 7.58 user 0.15 system 0.00 system 0.24 system 0.00 system 0:05.68 elapsed 0:05.54 elapsed 0:08.85 elapsed 0:07.60 elapsed fasm -d TEST=LUTs_are_awesome16 test.asm 2 passes, 1148 bytes. 2 passes, 1156 bytes. 2 passes, 1148 bytes. 2 passes, 1156 bytes. 5.53 user 5.53 user 8.54 user 7.61 user 0.17 system 0.16 system 0.21 system 0.00 system 0:05.71 elapsed 0:05.71 elapsed 0:08.78 elapsed 0:07.62 elapsed fasm -d TEST=LUTs_are_awesome_unaligned test.asm 2 passes, 1149 bytes. 2 passes, 1157 bytes. 2 passes, 1149 bytes. 2 passes, 1157 bytes. 6.46 user 5.53 user 7.82 user 7.64 user 0.24 system 0.22 system 0.19 system 0.00 system 0:06.71 elapsed 0:05.79 elapsed 0:08.02 elapsed 0:07.65 elapsed fasm -d TEST=uselss_cmp_chain test.asm 2 passes, 1136 bytes. 2 passes, 1138 bytes. 2 passes, 1136 bytes. 2 passes, 1138 bytes. 5.53 user 17.09 user 9.09 user 18.35 user 0.00 system 0.00 system 0.00 system 0.62 system 0:05.54 elapsed 0:17.11 elapsed 0:09.11 elapsed 0:19.00 elapsed fasm -d TEST=objectively_better_binary test.asm 2 passes, 1134 bytes. 2 passes, 1136 bytes. 2 passes, 1134 bytes. 2 passes, 1136 bytes. 6.45 user 27.12 user 6.45 user 27.51 user 0.00 system 0.04 system 0.00 system 0.29 system 0:06.46 elapsed 0:27.19 elapsed 0:06.46 elapsed 0:27.84 elapsed fasm -d TEST=four_way_junction test.asm 2 passes, 1130 bytes. 2 passes, 1132 bytes. 2 passes, 1130 bytes. 2 passes, 1132 bytes. 5.55 user 16.84 user 15.58 user 21.06 user 0.00 system 0.00 system 0.00 system 0.55 system 0:05.56 elapsed 0:16.86 elapsed 0:15.60 elapsed 0:21.66 elapsed fasm -d TEST=hack_1234_chain_1 test.asm 2 passes, 1130 bytes. 2 passes, 1132 bytes. 2 passes, 1130 bytes. 2 passes, 1132 bytes. 5.54 user 16.57 user 10.77 user 20.50 user 0.00 system 0.00 system 0.00 system 0.00 system 0:05.55 elapsed 0:16.59 elapsed 0:10.78 elapsed 0:20.52 elapsed fasm -d TEST=hack_1234_chain_2 test.asm 2 passes, 1130 bytes. 2 passes, 1132 bytes. 2 passes, 1130 bytes. 2 passes, 1132 bytes. 5.54 user 16.55 user 10.77 user 20.65 user 0.14 system 0.01 system 0.00 system 0.58 system 0:05.70 elapsed 0:16.61 elapsed 0:10.78 elapsed 0:21.26 elapsed fasm -d TEST=hack_1234_chain_3 test.asm 2 passes, 1130 bytes. 2 passes, 1132 bytes. 2 passes, 1130 bytes. 2 passes, 1132 bytes. 5.54 user 16.51 user 10.75 user 20.50 user 0.00 system 0.05 system 0.00 system 0.69 system 0:05.55 elapsed 0:16.59 elapsed 0:10.77 elapsed 0:21.21 elapsed fasm -d TEST=cmov_for_the_win test.asm 2 passes, 1162 bytes. 2 passes, 1164 bytes. 2 passes, 1162 bytes. 2 passes, 1164 bytes. 6.44 user 6.45 user 9.03 user 9.16 user 0.22 system 0.00 system 0.29 system 0.27 system 0:06.67 elapsed 0:06.46 elapsed 0:09.34 elapsed 0:09.45 elapsed fasm -d TEST=computed_goto test.asm 3 passes, 1132 bytes. 2 passes, 1134 bytes. 3 passes, 1132 bytes. 2 passes, 1134 bytes. 5.56 user (5.53) user 5.54 user (7.54) user ; I checked several times - this is not a random value! 0.14 system 0.00 system 0.01 system 0.00 system 0:05.71 elapsed 0:05.54 elapsed 0:05.57 elapsed 0:07.55 elapsed fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm 2 passes, 1134 bytes. 2 passes, 1136 bytes. 2 passes, 1134 bytes. 2 passes, 1136 bytes. 5.07 user 24.56 user 5.07 user 23.38 user 0.17 system 0.68 system 0.02 system 0.00 system 0:05.25 elapsed 0:25.27 elapsed 0:05.11 elapsed 0:23.41 elapsed LUT also changes the testing time when the template length increases. But this happens when the template length is more than 6 values. Most likely, these are the features of my processor. |
|||
23 Aug 2022, 00:01 |
|
revolution 23 Aug 2022, 03:40
macomics wrote: Does anyone know where this delay can come from (computed_goto)? But thankfully you didn't follow Furs' advice and forgo testing, because otherwise we would never have known this was a problem for your system with that code. |
|||
23 Aug 2022, 03:40 |
|
donn 23 Aug 2022, 04:57
Haven't read this entire thread, but seeing the good ol' optimization and performance arguments from revolution and furs again.
Noticing a pattern with revolution: you seem to err on the side of 'performance is not predictable until you test it.' And you give valid arguments. Not as familiar with furs' take, but seems to be more of the opposite, that testing is less significant. I agree more with this, but am happy to change my stance. Question for revolution: There are articles on AMD Manuals such as "Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors". Do these articles have any value in your opinion, if so why or why not? One quote from article: Quote: "Many factors affect instruction execution time. For instance, when a source operand must be loaded from a memory location, the time required to read the operand from system memory adds to the execution time." Which is your point, but they detail these factors and give specific performance suggestions, such as: Quote: "Avoid branches/jumps in the calculation of values." Performance is a risk assessment, no? It's 100% true that you can't know the exact performance, but you can predict probabilities based on principles, no? I've been doing a lot of Vulkan and SPIR-V lately and I keep having to learn concepts to get frame rates to render faster, like a 30 second clip rendered at 4 seconds per frame instead of it taking 24 hours total. If I jumped into this thread under the wrong pretext, apologies, feel free to ignore. |
|||
23 Aug 2022, 04:57 |
|
revolution 23 Aug 2022, 06:23
If you know the entire CPU internal state and the entire system configuration, and control all data flows then in theory you could predict everything.
But in reality you have no clue about the internal CPU state, and many other things, so any predictions are rendered invalid. For example, what is the state of your L1 cache? The BTB? The read/write buffers? the OoO engine? The other threads? etc. etc. etc. You can't know any of those things, they change frequently. They can be different each time you execute the code giving each call a different state to execute within. The best you can do is estimate based upon the vague principles that AMD and Intel like to advertise as being of more value than they really are. I proved this to myself when I tested the AMD and Intel recommendations they were good as often as they were bad. Besides, what is the problem with testing? Why so much resistance to it? Is everyone simply afraid to be proved wrong? By testing we do something called science. You test your assumptions and hypotheses to determine if they are real or imagined. Otherwise you just have a religion. X is better because Furs said it is. Don't ever test it lest we show the folly of such proclamations. Optimisation will be done by fiat! |
|||
23 Aug 2022, 06:23 |
|
macomics 23 Aug 2022, 06:41
Most likely, the delays appear due to the jump prediction cache. In principle, if it were not for 2 conditional transitions in testing cycles, then, I think, even with templates of 8 values, the system would have coped. But this is still my guess.
I will support revolution. Moreover, the work of some systems that you do not suspect can be calculated and found out about their existence only in practice. Testing is one of the approaches. There is a system that speeds up the execution of branches, but it turns out that the program you wrote is wildly slowing down and the system is not coping. Only reproducing in practice as a result of testing the situation that occurs in your program will allow you to find out the miscalculations and errors of system engineers, and, possibly, the features of using these mechanisms. In any case, writing any program is no different from testing hardware in practice. Although it's probably better to say this - rewriting and improving the work of programs is testing hardware. Last edited by macomics on 23 Aug 2022, 06:55; edited 1 time in total |
|||
23 Aug 2022, 06:41 |
|
bitRAKE 23 Aug 2022, 06:54
macomics wrote: Does anyone know where this delay can come from (computed_goto)? LUT is cost of computed goto plus data cost. Does processor even try to predicted LUT or computed goto? I think not - code is either in cache or not. With Jcc processor will try to load cache based on expectation. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
23 Aug 2022, 06:54 |
|
revolution 23 Aug 2022, 07:01
bitRAKE wrote: Does processor even try to predicted LUT or computed goto? Last edited by revolution on 23 Aug 2022, 07:12; edited 2 times in total |
|||
23 Aug 2022, 07:01 |
|
macomics 23 Aug 2022, 07:10
Here is the full text of the testing program
I'm testing on a tablet with Intel Atom 2 GHz Code: macro LUTs_are_awesome { movzx eax,al jmp dword[LUT+(eax-1)*4] align 16 dd 0,0 ; The LUT must be located on the border of the L1 cache line LUT: dd item1,item2,item3,item4 } ADD: Interestingly, an unaligned table gave better results than the first two macros with alignment. Makes you wonder if data alignment affects the overall performance of the ALU so much. By ALU, I just mean general-purpose registers and integer mathematics. |
|||
23 Aug 2022, 07:10 |
|
Roman 23 Aug 2022, 09:01
more powerful this
Code: macro Luts Dat,args& { movzx eax,byte [Dat] jmp dword[LUT+(eax-1)*4] align 16 dd 0,0 ; The LUT must be located on the border of the L1 cache line LUT: common forward dd args } ;in code Luts edi,item1,item2 ;in code another place Luts esi,item3,item4 ;in code another place Luts CurrentFunction,item5,item6 |
|||
23 Aug 2022, 09:01 |
|
macomics 23 Aug 2022, 09:12
Have you forgotten the zeros in your macro that I added for test purposes?
Code: dd 0,0 ; The LUT must be located on the border of the L1 cache line When using & to group macro parameters, there is no need to use common and forward. Code: LUT:
common
forward
dd args Most likely, it is worth adding a local LUT so that there is no error of re-declaring a label with this name. Code: macro Luts sel*,base*,miss*,args& { local LUT movzx eax,byte [sel] if ~ miss in <DONTCARE> if ~ base = 0 cmp eax, base jc miss end if cmp eax, base + LUT#.count jae miss end if jmp dword[LUT+(eax-base)*4] align 16 LUT dd args LUT#.count = ( $ - LUT ) shr 2 } ;in code Luts edi,1,DONTCARE,item1,item2 ;in code another place Luts esi,3,error,item3,item4 ;in code another place Luts CurrentFunction,5,error,item5,item6 Last edited by macomics on 23 Aug 2022, 09:47; edited 4 times in total |
|||
23 Aug 2022, 09:12 |
|
revolution 23 Aug 2022, 09:22
macomics wrote: ADD: Interestingly, an unaligned table gave better results than the first two macros with alignment. Makes you wonder if data alignment affects the overall performance of the ALU so much. By ALU, I just mean general-purpose registers and integer mathematics. Also any alignment penalty might be completely masked by other mechanisms taking longer and effectively making it invisible. |
|||
23 Aug 2022, 09:22 |
|
Goto page Previous 1, 2, 3, 4, 5, 6 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.