flat assembler
Message board for the users of flat assembler.

Index > Main > 256 possible order. What is the fastest code?

Goto page Previous  1, 2, 3, 4, 5, 6  Next
Author
Thread Post new topic Reply to topic
Furs



Joined: 04 Mar 2016
Posts: 1886
Furs
revolution wrote:
My colleague came back to me and she commented:
Quote:
Why does Furs keep moving the goalposts?

Why did Furs ignore the question about when we have no control over the data format, and keep recommending RLE?

Why does Furs say it is objective and then give a subjective value of maybe 4, or 5, or something else?

Why does Furs still insist no testing is required for the "obvious", and then say the difference is "more or less equal" when clearly the 50% difference is enormous? Our boss would be over-the-moon to have efficiency gains like that.

Is Furs even a programmer, or is it all just a troll?
I'm not sure about the last question. Furs?
First post in this thread: https://board.flatassembler.net/topic.php?p=224267#224267 (number of entries)
Second post in this thread: https://board.flatassembler.net/topic.php?p=224292#224292 (skewed data)

I'm moving the goal posts I stated originally?

Tell your colleague that she won the retarded troll award from me.

Anyway, I cannot take your test seriously, either because you screwed something up (e.g. power governor? background tasks?), or because the whole workload is a jmp, obviously skewing something here. There is no way that fall through performs worse than extra jmps.

Or it's alignment or something (in which case yours works by "luck" only better).

Period.
Post 21 Aug 2022, 14:08
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1886
Furs
Speaking of "lucky" alignment and test results, there is a reason we rule them out when thinking about writing code unless it's literally the last step. What if you change a tiny bit part of the code after some refactoring and suddenly it becomes extremely dog slow because of unlucky alignment?

Have fun testing every little function you write though, and everytime other code changes, you reset everything! And even when tested, it then becomes worse due to code refactoring in other areas so now you have to retest it everytime you change any code. What now? Talk about maintainability.

Even worse: you're testing it on your CPU only. Maybe it runs like dog on Joe's CPU instead. Release a million versions of your software and let each customer test it to see which one performs best on their CPU!

As a (professional?) software developer, you write software for a very large number of CPUs out there. That's where you care about performance, anyway, if it's an internal tool to help you then it will likely be written in some script since maintainability and ability to change/adapt it is far more important than raw performance in that case (not "released" software).

And in such case what purpose does such senseless testing provide other than waste time?

Everyone knows bubble sort is a disaster compared to better sorts. This is obvious due to complexity. Obvious. No testing needed. It will be the case on all CPUs.

Sure if there's only 2 items, it will be better than quicksort (or maybe the same). Same way I expect linear cmp chains to be same on around 4 items in real apps on most CPUs. My "cutoff" point.


Your colleague mentioned skewed data and no ability to know it in advance. Then why are you testing it? If it's not the data you expect the user to have, WHAT PURPOSE DOES OPTIMIZING FOR IT DO? In fact, in such conditions you should optimize for worst case complexity and in this case linear cmp chains are simply horrible especially with a large number of items.

You care if John Doe will complain of the app being insanely slow rather than Joe's "skewed" data being unnoticeably slower.

These are questions that are no brainer for anyone who actually writes software, not toys.
Post 21 Aug 2022, 16:00
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18844
Location: In your JS exploiting you and your system
revolution
macomics wrote:
Could you attach all the macros that were tested with a file?
Post your results also to compare how things change when the system is different.
Code:
rept 0 {
fasm -d TEST=uselss_cmp_chain test.asm && time ./test
fasm -d TEST=objectively_better_binary test.asm && time ./test
fasm -d TEST=four_way_junction test.asm && time ./test
fasm -d TEST=hack_1234_chain_1 test.asm && time ./test
fasm -d TEST=hack_1234_chain_2 test.asm && time ./test
fasm -d TEST=hack_1234_chain_3 test.asm && time ./test
fasm -d TEST=cmov_for_the_win test.asm && time ./test
fasm -d TEST=LUTs_are_awesome test.asm && time ./test
fasm -d TEST=computed_goto test.asm && time ./test
fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test
}

macro uselss_cmp_chain {
        cmp     al,1
        jz      item1
        cmp     al,2
        jz      item2
        cmp     al,3
        jz      item3
        cmp     al,4
        jz      item4
}

macro objectively_better_binary {
        cmp     al,2
        ja      high_half
        jb      item1
        jmp     item2
high_half:
        cmp     al,3
        je      item3
        jmp     item4
}

macro four_way_junction {
        ; assume AL is in range 1-4
        cmp     al,3
        ja      item4
        je      item3
        jpe     item2
        jmp     item1
}

macro hack_1234_chain_1 {
        sub     al,3 ;add al,-3
        jz      item3
        jp      item2
        js      item1
        jmp     item4
}

macro hack_1234_chain_2 {
        add     al,-3
        jz      item3
        jp      item2
        ja      item1
        jmp     item4
}

macro hack_1234_chain_3 {
        sub     al,3
        jz      item3
        jp      item2
        jb      item1
        jmp     item4
}

macro cmov_for_the_win {
        cmp     al,1
        mov     edx,item1
        cmovz   edi,edx
        cmp     al,2
        mov     edx,item2
        cmovz   edi,edx
        cmp     al,3
        mov     edx,item3
        cmovz   edi,edx
        cmp     al,4
        mov     edx,item4
        cmovz   edi,edx
        jmp     edi
}

macro LUTs_are_awesome {
        movzx   eax,al
        jmp     dword[LUT+(eax-1)*4]
        align 4
LUT:    dd      item1,item2,item3,item4
}

macro computed_goto {
        movzx   eax,al
        lea     eax,[(eax-1)*(item2-item1)+item1]
        jmp     eax
}

macro uber_no_need_to_test_code_from_Furs {
        cmp     al,3
        jb      @f
        je      item3
item4x: jmp     next_item
@@:     cmp     al,1
        je      item1
item2x: jmp     next_item
}

format elf executable

SYS_EXIT        = 1

        mov     ecx,1000000
outer_loop:
        mov     ebx,the_data_pattern_is_unimportant_len
        mov     esi,the_data_pattern_is_unimportant
inner_loop:
        lodsb
        match x,TEST {x}
next_item:
        dec     ebx
        jnz     inner_loop
        loop    outer_loop
        mov     eax,SYS_EXIT
        int     0x80

item1:  jmp     next_item
item2:  jmp     next_item
item3:  jmp     next_item
item4:  jmp     next_item

the_data_pattern_is_unimportant:
        while % <= 1000
                db 1
        end while
        db 2,3,4
the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant    
Post 22 Aug 2022, 09:14
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 652
Location: Russia
macomics
Thanks

ADD: left:
Code:
rept 0 {
fasm -d TEST=uselss_cmp_chain test.asm && time ./test
fasm -d TEST=objectively_better_binary test.asm && time ./test
fasm -d TEST=four_way_junction test.asm && time ./test
fasm -d TEST=hack_1234_chain_1 test.asm && time ./test
fasm -d TEST=hack_1234_chain_2 test.asm && time ./test
fasm -d TEST=hack_1234_chain_3 test.asm && time ./test
fasm -d TEST=cmov_for_the_win test.asm && time ./test
fasm -d TEST=LUTs_are_awesome test.asm && time ./test
fasm -d TEST=computed_goto test.asm && time ./test
fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test
}

macro uselss_cmp_chain {
        cmp     al,1
        jz      item1
        cmp     al,2
        jz      item2
        cmp     al,3
        jz      item3
        cmp     al,4
        jz      item4
}

macro objectively_better_binary {
        cmp     al,2
        ja      high_half
        jb      item1
        jmp     item2
high_half:
        cmp     al,3
        je      item3
        jmp     item4
}

macro four_way_junction {
        ; assume AL is in range 1-4
        cmp     al,3
        ja      item4
        je      item3
        jpe     item2
        jmp     item1
}

macro hack_1234_chain_1 {
        sub     al,3 ;add al,-3
        jz      item3
        jp      item2
        js      item1
        jmp     item4
}

macro hack_1234_chain_2 {
        add     al,-3
        jz      item3
        jp      item2
        ja      item1
        jmp     item4
}

macro hack_1234_chain_3 {
        sub     al,3
        jz      item3
        jp      item2
        jb      item1
        jmp     item4
}

macro cmov_for_the_win {
        cmp     al,1
        mov     edx,item1
        cmovz   edi,edx
        cmp     al,2
        mov     edx,item2
        cmovz   edi,edx
        cmp     al,3
        mov     edx,item3
        cmovz   edi,edx
        cmp     al,4
        mov     edx,item4
        cmovz   edi,edx
        jmp     edi
}

macro LUTs_are_awesome {
        movzx   eax,al
        jmp     dword[LUT+(eax-1)*4]
        align 4
LUT:    dd      item1,item2,item3,item4
}

macro computed_goto {
        movzx   eax,al
        lea     eax,[(eax-1)*(item2-item1)+item1]
        jmp     eax
}

macro uber_no_need_to_test_code_from_Furs {
        cmp     al,3
        jb      @f
        je      item3
item4x: jmp     next_item
@@:     cmp     al,1
        je      item1
item2x: jmp     next_item
}

format elf executable

SYS_EXIT        = 1

        mov     ecx,1000000
outer_loop:
        mov     ebx,the_data_pattern_is_unimportant_len
        mov     esi,the_data_pattern_is_unimportant
inner_loop:
        lodsb
        match x,TEST {x}
next_item:
        dec     ebx
        jnz     inner_loop
        loop    outer_loop
        mov     eax,SYS_EXIT
        int     0x80

item1:  jmp     next_item
item2:  jmp     next_item
item3:  jmp     next_item
item4:  jmp     next_item

the_data_pattern_is_unimportant:
        while % <= 1000
                db 1
        end while
        db 2,3,4
the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant    

right:
Code:
rept 0 {
fasm -d TEST=uselss_cmp_chain test.asm && time ./test
fasm -d TEST=objectively_better_binary test.asm && time ./test
fasm -d TEST=four_way_junction test.asm && time ./test
fasm -d TEST=hack_1234_chain_1 test.asm && time ./test
fasm -d TEST=hack_1234_chain_2 test.asm && time ./test
fasm -d TEST=hack_1234_chain_3 test.asm && time ./test
fasm -d TEST=cmov_for_the_win test.asm && time ./test
fasm -d TEST=LUTs_are_awesome test.asm && time ./test
fasm -d TEST=computed_goto test.asm && time ./test
fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test
}

macro uselss_cmp_chain {
        cmp     al,1
        jz      item1
        cmp     al,2
        jz      item2
        cmp     al,3
        jz      item3
        cmp     al,4
        jz      item4
}

macro objectively_better_binary {
        cmp     al,2
        ja      high_half
        jb      item1
        jmp     item2
high_half:
        cmp     al,3
        je      item3
        jmp     item4
}

macro four_way_junction {
        ; assume AL is in range 1-4
        cmp     al,3
        ja      item4
        je      item3
        jpe     item2
        jmp     item1
}

macro hack_1234_chain_1 {
        sub     al,3 ;add al,-3
        jz      item3
        jp      item2
        js      item1
        jmp     item4
}

macro hack_1234_chain_2 {
        add     al,-3
        jz      item3
        jp      item2
        ja      item1
        jmp     item4
}

macro hack_1234_chain_3 {
        sub     al,3
        jz      item3
        jp      item2
        jb      item1
        jmp     item4
}

macro cmov_for_the_win {
        cmp     al,1
        mov     edx,item1
        cmovz   edi,edx
        cmp     al,2
        mov     edx,item2
        cmovz   edi,edx
        cmp     al,3
        mov     edx,item3
        cmovz   edi,edx
        cmp     al,4
        mov     edx,item4
        cmovz   edi,edx
        jmp     edi
}

macro LUTs_are_awesome {
        movzx   eax,al
        jmp     dword[LUT+(eax-1)*4]
        align 4
LUT:    dd      item1,item2,item3,item4
}

macro computed_goto {
        movzx   eax,al
        lea     eax,[(eax-1)*(item2-item1)+item1]
        jmp     eax
}

macro uber_no_need_to_test_code_from_Furs {
        cmp     al,3
        jb      @f
        je      item3
item4x: jmp     next_item
@@:     cmp     al,1
        je      item1
item2x: jmp     next_item
}

format elf executable

SYS_EXIT        = 1

        jmp     skip_jumps

item1:  jmp     next_item
item2:  jmp     next_item
item3:  jmp     next_item
item4:  jmp     next_item

skip_jumps:
        mov     ecx,10000000
outer_loop:
        mov     ebx,the_data_pattern_is_unimportant_len
        mov     esi,the_data_pattern_is_unimportant
inner_loop:
        lodsb
        match x,TEST { x }
next_item:
        dec     ebx
        jnz     inner_loop
        loop    outer_loop
        mov     eax,SYS_EXIT
        int     0x80

the_data_pattern_is_unimportant:
        while % <= 1000
                db 1
        end while
        db 2,3,4
the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant    

Code:
   forward jumps           backward jumps
fasm -d TEST=uselss_cmp_chain test.asm && time ./test
2 passes, 1139 bytes.   2 passes, 1141 bytes.
  18.89 user              32.81 user
   0.57 system             0.01 system
0:19.48 elapsed         0:32.85 elapsed

fasm -d TEST=objectively_better_binary test.asm && time ./test
2 passes, 1137 bytes.   2 passes, 1139 bytes.
  18.72 user              19.26 user
   0.75 system             0.00 system
0:19.50 elapsed         0:19.28 elapsed

fasm -d TEST=four_way_junction test.asm && time ./test
2 passes, 1133 bytes.   2 passes, 1135 bytes.
  27.99 user              28.17 user
   0.00 system             0.55 system
0:28.02 elapsed         0:28.75 elapsed

fasm -d TEST=hack_1234_chain_1 test.asm && time ./test
2 passes, 1133 bytes.   2 passes, 1135 bytes.
  23.61 user              23.52 user
   0.78 system             0.47 system
0:24.41 elapsed         0:24.02 elapsed

fasm -d TEST=hack_1234_chain_2 test.asm && time ./test
2 passes, 1133 bytes.   2 passes, 1135 bytes.
  23.45 user              23.61 user
   0.48 system             0.45 system
0:23.96 elapsed         0:24.07 elapsed

fasm -d TEST=hack_1234_chain_3 test.asm && time ./test
2 passes, 1133 bytes.   2 passes, 1135 bytes.
  23.46 user              23.64 user
   0.36 system             0.32 system
0:23.84 elapsed         0:23.98 elapsed

fasm -d TEST=cmov_for_the_win test.asm && time ./test
2 passes, 1165 bytes.   2 passes, 1167 bytes.
  32.76 user              32.65 user
   0.58 system             0.48 system
0:33.37 elapsed         0:33.16 elapsed

fasm -d TEST=LUTs_are_awesome test.asm && time ./test
2 passes, 1151 bytes.   2 passes, 1151 bytes.
  18.86 user              18.80 user
   0.00 system             0.34 system
0:18.88 elapsed         0:19.16 elapsed

fasm -d TEST=computed_goto test.asm && time ./test
3 passes, 1135 bytes.   2 passes, 1137 bytes.
  18.75 user              18.77 user
   0.00 system             0.28 system
0:18.77 elapsed         0:19.07 elapsed

fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test
2 passes, 1137 bytes.   2 passes, 1139 bytes.
  23.11 user              37.48 user
   0.00 system             0.56 system
0:23.13 elapsed         0:38.07 elapsed    
I have increased the repetition counter 10 times
Post 22 Aug 2022, 10:43
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 652
Location: Russia
macomics
These values are given out by your code without changes
Code:
fasm -d TEST=uselss_cmp_chain test.asm && time ./test                           
2 passes, 1139 bytes.                                                           
   1.89 user
   0.00 system
0:01.90 elapsed


fasm -d TEST=objectively_better_binary test.asm && time ./test
2 passes, 1137 bytes.
   1.87 user
   0.00 system
0:01.87 elapsed


fasm -d TEST=four_way_junction test.asm && time ./test                          
2 passes, 1133 bytes.                                                           
   2.79 user
   0.00 system
0:02.80 elapsed


fasm -d TEST=hack_1234_chain_1 test.asm && time ./test
2 passes, 1133 bytes.
   2.37 user
   0.03 system
0:02.41 elapsed


fasm -d TEST=hack_1234_chain_2 test.asm && time ./test                          
2 passes, 1133 bytes.                                                           
   2.36 user
   0.00 system
0:02.36 elapsed


fasm -d TEST=hack_1234_chain_3 test.asm && time ./test
2 passes, 1133 bytes.
   2.36 user
   0.00 system
0:02.36 elapsed


fasm -d TEST=cmov_for_the_win test.asm && time ./test                           
2 passes, 1165 bytes.                                                           
   3.27 user
   0.00 system
0:03.27 elapsed


fasm -d TEST=LUTs_are_awesome test.asm && time ./test
2 passes, 1151 bytes.
   1.88 user
   0.00 system
0:01.88 elapsed


fasm -d TEST=computed_goto test.asm && time ./test                              
3 passes, 1135 bytes.                                                           
   1.87 user
   0.00 system
0:01.88 elapsed


fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test
2 passes, 1137 bytes.
   2.30 user
   0.00 system
0:02.31 elapsed    
Post 22 Aug 2022, 12:00
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1886
Furs
See? 1.89 vs 1.87 is pretty much the same thing (even in favor of binary search but w/e), suggesting the rule of thumb of 4 entries being the cutoff point that I posted originally in my very first post in this thread, as advice, seems to be actually useful. Shocking. But yeah I'm moving goal posts.

Anyway, when people ask for "what is fastest" they generally ask for advice on rule of thumbs. Testing it is fine, but since usually such tests are artificial and depend on the system in question, unless the difference is drastic, it's simply not good advice.

When people code, they just want to write the code. They ask it because they just want to know how to start it. And applying some simple design decisions/optimizations as you write is good enough, without having to test it for hours on different systems.

If you really care about a particular hot spot, yeah, go test it, but in 99% of cases, apply rule of thumbs and "obvious" optimizations. That's far more useful for people who ask this question, as advice. Telling them to "test" is usually the same as not answering at all. Wink


Also as you said, data pattern matters. But unless you know the data the user is going to use, testing on skewed data is anti-productive in regards to making a decision. You test on skewed data, sure, to find the pathological worst case. And in that case, binary search will always be better (if entries are large enough, i.e. the 4 cutoff point) because linear search has very bad worst case when data is skewed in opposite direction.

You don't know the data, don't use specific data as point.


Imagine someone asking "what should i use, shift or mul by power of 2?" and your answer is "test it". Rolling Eyes
Post 22 Aug 2022, 12:54
View user's profile Send private message Reply with quote
Fastestcodes



Joined: 13 Jun 2022
Posts: 74
Fastestcodes
ProMiNick wrote:
Code:
cmp     eax, 255  ;movzx eax,al
ja      notjmptbl ;
jmp     dword[jmptbl+eax*4]; here eax is always <256, no matter filtered eax by compare or zeroed high part out of al
jmptbl:
        dd La00
        dd La01
        ...
        dd LaFF
notjmptbl:    

It works. Thx all.
section'.data'
jtbl dd La00,La01...Laff


movzx eax,byte[esi]
Post 22 Aug 2022, 14:44
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 652
Location: Russia
macomics
I also wanted to find out how the above macros work if all 4 labels are involved in the tests equally.
Code:
    forward jumps             backward jumps
fasm -d TEST=uselss_cmp_chain test.asm
2 passes, 1136 bytes.   2 passes, 1138 bytes.
  27.65 user            1:24.48 user
   0.00 system             0.06 system
0:27.67 elapsed         1:24.58 elapsed


fasm -d TEST=objectively_better_binary test.asm
2 passes, 1134 bytes.   2 passes, 1136 bytes.
  32.23 user            2:15.47 user
   0.02 system             0.00 system
0:32.28 elapsed         2:15.55 elapsed


fasm -d TEST=four_way_junction test.asm
2 passes, 1130 bytes.   2 passes, 1132 bytes.
  27.72 user            1:17.49 user
   0.00 system             0.00 system
0:27.74 elapsed         1:17.54 elapsed


fasm -d TEST=hack_1234_chain_1 test.asm
2 passes, 1130 bytes.   2 passes, 1132 bytes.
  27.72 user            1:22.72 user
   0.00 system             0.00 system
0:27.74 elapsed         1:22.77 elapsed


fasm -d TEST=hack_1234_chain_2 test.asm
2 passes, 1130 bytes.   2 passes, 1132 bytes.
  27.70 user            1:22.08 user
   0.01 system             0.89 system
0:27.74 elapsed         1:23.03 elapsed


fasm -d TEST=hack_1234_chain_3 test.asm
2 passes, 1130 bytes.   2 passes, 1132 bytes.
  27.71 user            1:21.56 user
   0.00 system             1.68 system
0:27.73 elapsed         1:23.36 elapsed


fasm -d TEST=cmov_for_the_win test.asm
2 passes, 1162 bytes.   2 passes, 1164 bytes.
  32.37 user              32.33 user
   0.00 system             1.11 system
0:32.40 elapsed         0:33.48 elapsed


fasm -d TEST=LUTs_are_awesome test.asm
2 passes, 1148 bytes.   2 passes, 1148 bytes.
  27.65 user              27.70 user
   0.00 system             0.96 system
0:27.67 elapsed         0:28.70 elapsed


fasm -d TEST=computed_goto test.asm
3 passes, 1132 bytes.   2 passes, 1134 bytes.
  27.63 user              27.68 user
   0.01 system             1.02 system
0:27.66 elapsed         0:28.75 elapsed


fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm
2 passes, 1134 bytes.   2 passes, 1136 bytes.
  25.34 user            2:00.77 user
   0.01 system             2.09 system
0:25.37 elapsed         2:02.99 elapsed    

left:
Code:
rept 0 {
fasm -d TEST=uselss_cmp_chain test.asm && time ./test
fasm -d TEST=objectively_better_binary test.asm && time ./test
fasm -d TEST=four_way_junction test.asm && time ./test
fasm -d TEST=hack_1234_chain_1 test.asm && time ./test
fasm -d TEST=hack_1234_chain_2 test.asm && time ./test
fasm -d TEST=hack_1234_chain_3 test.asm && time ./test
fasm -d TEST=cmov_for_the_win test.asm && time ./test
fasm -d TEST=LUTs_are_awesome test.asm && time ./test
fasm -d TEST=computed_goto test.asm && time ./test
fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test
}

macro uselss_cmp_chain {
        cmp     al,1
        jz      item1
        cmp     al,2
        jz      item2
        cmp     al,3
        jz      item3
        cmp     al,4
        jz      item4
}

macro objectively_better_binary {
        cmp     al,2
        ja      high_half
        jb      item1
        jmp     item2
high_half:
        cmp     al,3
        je      item3
        jmp     item4
}

macro four_way_junction {
        ; assume AL is in range 1-4
        cmp     al,3
        ja      item4
        je      item3
        jpe     item2
        jmp     item1
}

macro hack_1234_chain_1 {
        sub     al,3 ;add al,-3
        jz      item3
        jp      item2
        js      item1
        jmp     item4
}

macro hack_1234_chain_2 {
        add     al,-3
        jz      item3
        jp      item2
        ja      item1
        jmp     item4
}

macro hack_1234_chain_3 {
        sub     al,3
        jz      item3
        jp      item2
        jb      item1
        jmp     item4
}

macro cmov_for_the_win {
        cmp     al,1
        mov     edx,item1
        cmovz   edi,edx
        cmp     al,2
        mov     edx,item2
        cmovz   edi,edx
        cmp     al,3
        mov     edx,item3
        cmovz   edi,edx
        cmp     al,4
        mov     edx,item4
        cmovz   edi,edx
        jmp     edi
}

macro LUTs_are_awesome {
        movzx   eax,al
        jmp     dword[LUT+(eax-1)*4]
        align 4
LUT:    dd      item1,item2,item3,item4
}

macro computed_goto {
        movzx   eax,al
        lea     eax,[(eax-1)*(item2-item1)+item1]
        jmp     eax
}

macro uber_no_need_to_test_code_from_Furs {
        cmp     al,3
        jb      @f
        je      item3
item4x: jmp     next_item
@@:     cmp     al,1
        je      item1
item2x: jmp     next_item
}

format elf executable

SYS_EXIT        = 1

        mov     ecx,10000000
outer_loop:
        mov     ebx,the_data_pattern_is_unimportant_len
        mov     esi,the_data_pattern_is_unimportant
inner_loop:
        lodsb
        match x,TEST {x}
next_item:
        dec     ebx
        jnz     inner_loop
        loop    outer_loop
        mov     eax,SYS_EXIT
        int     0x80

item1:  jmp     next_item
item2:  jmp     next_item
item3:  jmp     next_item
item4:  jmp     next_item

the_data_pattern_is_unimportant:
        while % <= 250
                db 1,3,4,2
        end while
the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant    
right:
Code:
rept 0 {
fasm -d TEST=uselss_cmp_chain test.asm && time ./test
fasm -d TEST=objectively_better_binary test.asm && time ./test
fasm -d TEST=four_way_junction test.asm && time ./test
fasm -d TEST=hack_1234_chain_1 test.asm && time ./test
fasm -d TEST=hack_1234_chain_2 test.asm && time ./test
fasm -d TEST=hack_1234_chain_3 test.asm && time ./test
fasm -d TEST=cmov_for_the_win test.asm && time ./test
fasm -d TEST=LUTs_are_awesome test.asm && time ./test
fasm -d TEST=computed_goto test.asm && time ./test
fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test
}

macro uselss_cmp_chain {
        cmp     al,1
        jz      item1
        cmp     al,2
        jz      item2
        cmp     al,3
        jz      item3
        cmp     al,4
        jz      item4
}

macro objectively_better_binary {
        cmp     al,2
        ja      high_half
        jb      item1
        jmp     item2
high_half:
        cmp     al,3
        je      item3
        jmp     item4
}

macro four_way_junction {
        ; assume AL is in range 1-4
        cmp     al,3
        ja      item4
        je      item3
        jpe     item2
        jmp     item1
}

macro hack_1234_chain_1 {
        sub     al,3 ;add al,-3
        jz      item3
        jp      item2
        js      item1
        jmp     item4
}

macro hack_1234_chain_2 {
        add     al,-3
        jz      item3
        jp      item2
        ja      item1
        jmp     item4
}

macro hack_1234_chain_3 {
        sub     al,3
        jz      item3
        jp      item2
        jb      item1
        jmp     item4
}

macro cmov_for_the_win {
        cmp     al,1
        mov     edx,item1
        cmovz   edi,edx
        cmp     al,2
        mov     edx,item2
        cmovz   edi,edx
        cmp     al,3
        mov     edx,item3
        cmovz   edi,edx
        cmp     al,4
        mov     edx,item4
        cmovz   edi,edx
        jmp     edi
}

macro LUTs_are_awesome {
        movzx   eax,al
        jmp     dword[LUT+(eax-1)*4]
        align 4
LUT:    dd      item1,item2,item3,item4
}

macro computed_goto {
        movzx   eax,al
        lea     eax,[(eax-1)*(item2-item1)+item1]
        jmp     eax
}

macro uber_no_need_to_test_code_from_Furs {
        cmp     al,3
        jb      @f
        je      item3
item4x: jmp     next_item
@@:     cmp     al,1
        je      item1
item2x: jmp     next_item
}

format elf executable

SYS_EXIT        = 1

        jmp     skip_jumps

item1:  jmp     next_item
item2:  jmp     next_item
item3:  jmp     next_item
item4:  jmp     next_item

skip_jumps:
        mov     ecx,10000000
outer_loop:
        mov     ebx,the_data_pattern_is_unimportant_len
        mov     esi,the_data_pattern_is_unimportant
inner_loop:
        lodsb
        match x,TEST { x }
next_item:
        dec     ebx
        jnz     inner_loop
        loop    outer_loop
        mov     eax,SYS_EXIT
        int     0x80

the_data_pattern_is_unimportant:
        while % <= 250
                db 1,3,4,2
        end while
the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant    


Total: It's best to LUT or calculate the jump address. They hold various patterns the most stable and do not stray to predict transitions.
Post 22 Aug 2022, 15:00
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18844
Location: In your JS exploiting you and your system
revolution
macomics wrote:
Total: It's best to LUT or calculate the jump address. They hold various patterns the most stable and do not stray to predict transitions.
Which you could not have known if you didn't test it.

All the talk from Furs about how it is "obvious" and doesn't need testing it clearly just a troll, like my colleague suggested previously. Thanks go to Furs for highlighting the need to test our code with different paths to make sure we are getting the results we want.
Post 22 Aug 2022, 22:14
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 652
Location: Russia
macomics
Code:
macro LUTs_are_awesome {
        movzx   eax,al
        jmp     dword[LUT+(eax-1)*4]
        align 4
LUT:    dd      item1,item2,item3,item4
}

macro LUTs_are_awesome16 {
        movzx   eax,al
        jmp     dword[LUT+(eax-1)*4]
        align 16
LUT:    dd      item1,item2,item3,item4
}

macro LUTs_are_awesome_unaligned {
        movzx   eax,al
        jmp     dword[LUT+(eax-1)*4]
        align 16
        db      1
LUT:    dd      item1,item2,item3,item4
}    


Code:
                                        2000000 times
                pattern 1,3,4,2                             pattern 1,3,4,2,1,2,3,4
    forward jumps           backward jumps          forward jumps           backward jumps
;--------------------------------------------------------------------------------------------
fasm -d TEST=LUTs_are_awesome test.asm
2 passes, 1148 bytes.   2 passes, 1148 bytes.   2 passes, 1148 bytes.   2 passes, 1148 bytes.
   5.51 user               5.53 user               8.60 user               7.58 user
   0.15 system             0.00 system             0.24 system             0.00 system
0:05.68 elapsed         0:05.54 elapsed         0:08.85 elapsed         0:07.60 elapsed


fasm -d TEST=LUTs_are_awesome16 test.asm
2 passes, 1148 bytes.   2 passes, 1156 bytes.   2 passes, 1148 bytes.   2 passes, 1156 bytes.
   5.53 user               5.53 user               8.54 user               7.61 user
   0.17 system             0.16 system             0.21 system             0.00 system
0:05.71 elapsed         0:05.71 elapsed         0:08.78 elapsed         0:07.62 elapsed


fasm -d TEST=LUTs_are_awesome_unaligned test.asm
2 passes, 1149 bytes.   2 passes, 1157 bytes.   2 passes, 1149 bytes.   2 passes, 1157 bytes.
   6.46 user               5.53 user               7.82 user               7.64 user
   0.24 system             0.22 system             0.19 system             0.00 system
0:06.71 elapsed         0:05.79 elapsed         0:08.02 elapsed         0:07.65 elapsed


fasm -d TEST=uselss_cmp_chain test.asm
2 passes, 1136 bytes.   2 passes, 1138 bytes.   2 passes, 1136 bytes.   2 passes, 1138 bytes.
   5.53 user              17.09 user               9.09 user              18.35 user
   0.00 system             0.00 system             0.00 system             0.62 system
0:05.54 elapsed         0:17.11 elapsed         0:09.11 elapsed         0:19.00 elapsed


fasm -d TEST=objectively_better_binary test.asm
2 passes, 1134 bytes.   2 passes, 1136 bytes.   2 passes, 1134 bytes.   2 passes, 1136 bytes.
   6.45 user              27.12 user               6.45 user              27.51 user
   0.00 system             0.04 system             0.00 system             0.29 system
0:06.46 elapsed         0:27.19 elapsed         0:06.46 elapsed         0:27.84 elapsed


fasm -d TEST=four_way_junction test.asm
2 passes, 1130 bytes.   2 passes, 1132 bytes.   2 passes, 1130 bytes.   2 passes, 1132 bytes.
   5.55 user              16.84 user              15.58 user              21.06 user
   0.00 system             0.00 system             0.00 system             0.55 system
0:05.56 elapsed         0:16.86 elapsed         0:15.60 elapsed         0:21.66 elapsed


fasm -d TEST=hack_1234_chain_1 test.asm
2 passes, 1130 bytes.   2 passes, 1132 bytes.   2 passes, 1130 bytes.   2 passes, 1132 bytes.
   5.54 user              16.57 user              10.77 user              20.50 user
   0.00 system             0.00 system             0.00 system             0.00 system
0:05.55 elapsed         0:16.59 elapsed         0:10.78 elapsed         0:20.52 elapsed


fasm -d TEST=hack_1234_chain_2 test.asm
2 passes, 1130 bytes.   2 passes, 1132 bytes.   2 passes, 1130 bytes.   2 passes, 1132 bytes.
   5.54 user              16.55 user              10.77 user              20.65 user
   0.14 system             0.01 system             0.00 system             0.58 system
0:05.70 elapsed         0:16.61 elapsed         0:10.78 elapsed         0:21.26 elapsed


fasm -d TEST=hack_1234_chain_3 test.asm
2 passes, 1130 bytes.   2 passes, 1132 bytes.   2 passes, 1130 bytes.   2 passes, 1132 bytes.
   5.54 user              16.51 user              10.75 user              20.50 user
   0.00 system             0.05 system             0.00 system             0.69 system
0:05.55 elapsed         0:16.59 elapsed         0:10.77 elapsed         0:21.21 elapsed


fasm -d TEST=cmov_for_the_win test.asm
2 passes, 1162 bytes.   2 passes, 1164 bytes.   2 passes, 1162 bytes.   2 passes, 1164 bytes.
   6.44 user               6.45 user               9.03 user               9.16 user
   0.22 system             0.00 system             0.29 system             0.27 system
0:06.67 elapsed         0:06.46 elapsed         0:09.34 elapsed         0:09.45 elapsed


fasm -d TEST=computed_goto test.asm
3 passes, 1132 bytes.   2 passes, 1134 bytes.   3 passes, 1132 bytes.   2 passes, 1134 bytes.
   5.56 user               (5.53) user             5.54 user               (7.54) user ; I checked several times - this is not a random value!
   0.14 system             0.00 system             0.01 system             0.00 system
0:05.71 elapsed         0:05.54 elapsed         0:05.57 elapsed         0:07.55 elapsed


fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm
2 passes, 1134 bytes.   2 passes, 1136 bytes.   2 passes, 1134 bytes.   2 passes, 1136 bytes.
   5.07 user              24.56 user               5.07 user              23.38 user
   0.17 system             0.68 system             0.02 system             0.00 system
0:05.25 elapsed         0:25.27 elapsed         0:05.11 elapsed         0:23.41 elapsed    
Does anyone know where this delay can come from (computed_goto)? Perhaps there is an error in the formulas when calculating.

LUT also changes the testing time when the template length increases. But this happens when the template length is more than 6 values. Most likely, these are the features of my processor.
Post 23 Aug 2022, 00:01
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18844
Location: In your JS exploiting you and your system
revolution
macomics wrote:
Does anyone know where this delay can come from (computed_goto)?
Without detailed knowledge of the internals of your CPU it is difficult to know. We could try to guess some sort of buffer or prediction mechanism has exceeded it's limit.

But thankfully you didn't follow Furs' advice and forgo testing, because otherwise we would never have known this was a problem for your system with that code.
Post 23 Aug 2022, 03:40
View user's profile Send private message Visit poster's website Reply with quote
donn



Joined: 05 Mar 2010
Posts: 230
donn
Haven't read this entire thread, but seeing the good ol' optimization and performance arguments from revolution and furs again.

Noticing a pattern with revolution: you seem to err on the side of 'performance is not predictable until you test it.' And you give valid arguments. Not as familiar with furs' take, but seems to be more of the opposite, that testing is less significant. I agree more with this, but am happy to change my stance.

Question for revolution: There are articles on AMD Manuals such as "Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors". Do these articles have any value in your opinion, if so why or why not?

One quote from article:
Quote:
"Many factors affect instruction execution time. For instance, when a source operand must be loaded from a memory location, the time required to read the operand from system memory adds to the execution time."


Which is your point, but they detail these factors and give specific performance suggestions, such as:

Quote:
"Avoid branches/jumps in the calculation of values."

Performance is a risk assessment, no? It's 100% true that you can't know the exact performance, but you can predict probabilities based on principles, no? I've been doing a lot of Vulkan and SPIR-V lately and I keep having to learn concepts to get frame rates to render faster, like a 30 second clip rendered at 4 seconds per frame instead of it taking 24 hours total.

If I jumped into this thread under the wrong pretext, apologies, feel free to ignore.
Post 23 Aug 2022, 04:57
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18844
Location: In your JS exploiting you and your system
revolution
If you know the entire CPU internal state and the entire system configuration, and control all data flows then in theory you could predict everything.

But in reality you have no clue about the internal CPU state, and many other things, so any predictions are rendered invalid.

For example, what is the state of your L1 cache? The BTB? The read/write buffers? the OoO engine? The other threads? etc. etc. etc. You can't know any of those things, they change frequently. They can be different each time you execute the code giving each call a different state to execute within.

The best you can do is estimate based upon the vague principles that AMD and Intel like to advertise as being of more value than they really are. I proved this to myself when I tested the AMD and Intel recommendations they were good as often as they were bad.

Besides, what is the problem with testing? Why so much resistance to it? Is everyone simply afraid to be proved wrong? By testing we do something called science. You test your assumptions and hypotheses to determine if they are real or imagined. Otherwise you just have a religion. X is better because Furs said it is. Don't ever test it lest we show the folly of such proclamations. Laughing Optimisation will be done by fiat! SmileSmile
Post 23 Aug 2022, 06:23
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 652
Location: Russia
macomics
Most likely, the delays appear due to the jump prediction cache. In principle, if it were not for 2 conditional transitions in testing cycles, then, I think, even with templates of 8 values, the system would have coped. But this is still my guess.

I will support revolution. Moreover, the work of some systems that you do not suspect can be calculated and found out about their existence only in practice. Testing is one of the approaches. There is a system that speeds up the execution of branches, but it turns out that the program you wrote is wildly slowing down and the system is not coping. Only reproducing in practice as a result of testing the situation that occurs in your program will allow you to find out the miscalculations and errors of system engineers, and, possibly, the features of using these mechanisms.

In any case, writing any program is no different from testing hardware in practice. Although it's probably better to say this - rewriting and improving the work of programs is testing hardware.


Last edited by macomics on 23 Aug 2022, 06:55; edited 1 time in total
Post 23 Aug 2022, 06:41
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3483
Location: vpcmipstrm
bitRAKE
macomics wrote:
Does anyone know where this delay can come from (computed_goto)?
Maybe I missed what processors test is being executed on?

LUT is cost of computed goto plus data cost. Does processor even try to predicted LUT or computed goto? I think not - code is either in cache or not. With Jcc processor will try to load cache based on expectation.

_________________
¯\(°_o)/¯ unlicense.org
Post 23 Aug 2022, 06:54
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18844
Location: In your JS exploiting you and your system
revolution
bitRAKE wrote:
Does processor even try to predicted LUT or computed goto?
I think all current CPUs have a BTB that attempts to predict jmps. Perhaps there are CPUs that don't have a BTB? It's been a while since I bothered to read CPU specs. I found them mostly only useful for marketing and hype. The real proof is in the running of the code to see if such things are effective.


Last edited by revolution on 23 Aug 2022, 07:12; edited 2 times in total
Post 23 Aug 2022, 07:01
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 652
Location: Russia
macomics
Here is the full text of the testing program

I'm testing on a tablet with Intel Atom 2 GHz

Code:
macro LUTs_are_awesome {
        movzx   eax,al
        jmp     dword[LUT+(eax-1)*4]
        align 16
        dd 0,0 ; The LUT must be located on the border of the L1 cache line
LUT:    dd      item1,item2,item3,item4
}    
I was interested in the problem of the location of the LUT relative to the L1 cache. That's why I added 3 variants of this macro. However, most likely this testing should be carried out on programs with a large number of branches.

ADD: Interestingly, an unaligned table gave better results than the first two macros with alignment. Makes you wonder if data alignment affects the overall performance of the ALU so much. By ALU, I just mean general-purpose registers and integer mathematics.
Post 23 Aug 2022, 07:10
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 1251
Roman
more powerful this
Code:
macro Luts Dat,args& {
        movzx   eax,byte [Dat]
        jmp     dword[LUT+(eax-1)*4]
        align 16
        dd 0,0 ; The LUT must be located on the border of the L1 cache line
LUT:    
common
forward
dd      args
}  

;in code
Luts edi,item1,item2
;in code another place
Luts esi,item3,item4
;in code another place
Luts CurrentFunction,item5,item6
    
Post 23 Aug 2022, 09:01
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 652
Location: Russia
macomics
Have you forgotten the zeros in your macro that I added for test purposes?
Code:
        dd 0,0 ; The LUT must be located on the border of the L1 cache line    


When using & to group macro parameters, there is no need to use common and forward.
Code:
LUT:    
common
forward
dd      args    


Most likely, it is worth adding a local LUT so that there is no error of re-declaring a label with this name.

Code:
macro Luts sel*,base*,miss*,args& { local LUT
        movzx   eax,byte [sel]
  if ~ miss in <DONTCARE>
    if ~ base = 0
        cmp     eax, base
        jc      miss
    end if
        cmp     eax, base + LUT#.count
        jae     miss
  end if
        jmp     dword[LUT+(eax-base)*4]
align 16
LUT     dd      args
LUT#.count = ( $ - LUT ) shr 2 }  

;in code
Luts edi,1,DONTCARE,item1,item2
;in code another place
Luts esi,3,error,item3,item4
;in code another place
Luts CurrentFunction,5,error,item5,item6    


Last edited by macomics on 23 Aug 2022, 09:47; edited 4 times in total
Post 23 Aug 2022, 09:12
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18844
Location: In your JS exploiting you and your system
revolution
macomics wrote:
ADD: Interestingly, an unaligned table gave better results than the first two macros with alignment. Makes you wonder if data alignment affects the overall performance of the ALU so much. By ALU, I just mean general-purpose registers and integer mathematics.
The ALU is not involved with alignment in any way. It is the data load that has to re-align the data before it gets delivered to the ALU.

Also any alignment penalty might be completely masked by other mechanisms taking longer and effectively making it invisible.
Post 23 Aug 2022, 09:22
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5, 6  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.