flat assembler
Message board for the users of flat assembler.

Index > Main > 256 possible order. What is the fastest code?

Goto page Previous  1, 2, 3, 4, 5, 6  Next
Author
Thread Post new topic Reply to topic
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18837
Location: In your JS exploiting you and your system
revolution
None of those are binary searches. Furs claimed that binary search is objectively better, we need to test that claim to show how "obvious" it is.

Anyhow, the point is well proved, that the data pattern makes all the difference. You need to optimise the code to suit the data patterns encountered in normal operation. Different data patterns need different code.
Post 18 Aug 2022, 17:54
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3464
Location: vpcmipstrm
bitRAKE
LAHF and we get the full branch potential. Very Happy
(Or a single bit shift.)

_________________
¯\(°_o)/¯ unlicense.org


Last edited by bitRAKE on 19 Aug 2022, 04:29; edited 1 time in total
Post 19 Aug 2022, 04:11
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18837
Location: In your JS exploiting you and your system
revolution
I got these four new variants tested on the same box as before.
Code:
the_data_pattern_is_unimportant:
        while % <= 1000
                db 1
        end while
        db 2,3,4
the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant    
Code:
~ fasm -d TEST=four_way_junction test.asm && time ./test 
flat assembler  version 1.73.08  (4042780 kilobytes memory)
2 passes, 1133 bytes.

real    0m7.084s
user    0m6.976s
sys     0m0.016s
~ fasm -d TEST=hack_1234_chain_1 test.asm && time ./test 
flat assembler  version 1.73.08  (4042332 kilobytes memory)
2 passes, 1133 bytes.

real    0m6.836s
user    0m6.556s
sys     0m0.004s
~ fasm -d TEST=hack_1234_chain_2 test.asm && time ./test 
flat assembler  version 1.73.08  (4032536 kilobytes memory)
2 passes, 1133 bytes.

real    0m6.561s
user    0m6.412s
sys     0m0.004s
~ fasm -d TEST=hack_1234_chain_3 test.asm && time ./test 
flat assembler  version 1.73.08  (4048760 kilobytes memory)
2 passes, 1133 bytes.

real    0m6.483s
user    0m6.408s
sys     0m0.000s    
Different data
Code:
the_data_pattern_is_unimportant:
        while % <= 1000
                db 4
        end while
        db 2,3,1
the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant    
Code:
~ fasm -d TEST=four_way_junction test.asm && time ./test 
flat assembler  version 1.73.08  (4032044 kilobytes memory)
2 passes, 1133 bytes.

real    0m5.167s
user    0m5.092s
sys     0m0.004s
~ fasm -d TEST=hack_1234_chain_1 test.asm && time ./test 
flat assembler  version 1.73.08  (4043256 kilobytes memory)
2 passes, 1133 bytes.

real    0m6.823s
user    0m6.776s
sys     0m0.000s
~ fasm -d TEST=hack_1234_chain_2 test.asm && time ./test 
flat assembler  version 1.73.08  (4026552 kilobytes memory)
2 passes, 1133 bytes.

real    0m6.892s
user    0m6.768s
sys     0m0.016s
~ fasm -d TEST=hack_1234_chain_3 test.asm && time ./test 
flat assembler  version 1.73.08  (4042588 kilobytes memory)
2 passes, 1133 bytes.

real    0m6.873s
user    0m6.784s
sys     0m0.000s    
None of those beat the "useless_cmp_chain". And again the times change depending upon the data pattern.
Post 19 Aug 2022, 04:15
View user's profile Send private message Visit poster's website Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8026
Location: Kraków, Poland
Tomasz Grysztar
revolution wrote:
None of those beat the "useless_cmp_chain". And again the times change depending upon the data pattern.
As I said, at least my post was not supposed to add to the topic, it was just a bit of fun to make the thread less dense. Yes, multi-way junctions are fun! Wink
Post 19 Aug 2022, 07:39
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18837
Location: In your JS exploiting you and your system
revolution
I do hope that everyone realises that these synthetic tests are only valid for the actual code shown, on the actual system it ran on, with the actual data used. That is, they shouldn't be used to make decisions on some other code, on some other system, on different data.

It would be instructive to have results from the same code and data above, but on different systems. I expect results to be all over the place, with "faster" code becoming "slower", on some other systems.
Post 19 Aug 2022, 07:48
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18837
Location: In your JS exploiting you and your system
revolution
I notice that the "fastest" code is also the largest.

Perhaps we should optimise for anti-size, and then get the fastest. Laughing
Post 19 Aug 2022, 08:15
View user's profile Send private message Visit poster's website Reply with quote
FlierMate1



Joined: 31 May 2022
Posts: 118
FlierMate1
Tomasz Grysztar wrote:
Side-stepping the topic a little, this could be made into a fun 4-way junction:
Code:
; assume AL is in range 1-4
        cmp     al,3
        ja      item4
        je      item3
        jpe     item2
        jmp     item1    


Nice!!!

JA=Jump if above 3, i.e. 4
JE=Jump if equals 3
JPE=Jump if parity even, in this case, 2
JMP=Jump, when value is 1
Post 19 Aug 2022, 08:34
View user's profile Send private message Reply with quote
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8026
Location: Kraków, Poland
Tomasz Grysztar
FlierMate1 wrote:
JPE=Jump if parity even, in this case, 2
This deserves a more careful explanation, to avoid it being misunderstood as jump taken because 2 is even. PF does not say that, it counts the set bits among the lowest 8 bits of the result, and reflects whether that number is even or not. In this case the data is 8-bit, so it counts 1s in the entirety of the result.

CMP is the same as SUB, except it does not update the destination register with the result, only flags. If we subtract 3 from 2, we get -1, which is 11111111b and has an even number of 1s. This allows to differentiate it from 1 - 3 = -2, which is 11111110b and has an odd number of 1s.

revolution wrote:
I do hope that everyone realises that these synthetic tests are only valid for the actual code shown, on the actual system it ran on, with the actual data used. That is, they shouldn't be used to make decisions on some other code, on some other system, on different data.
I did hope that you would show up in these threads, you have the most patience to explain this point.
Post 19 Aug 2022, 08:58
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 575
Location: Ukraine
Overclick
Code:
jmp [adritem0+rax*8]
adritem0 dq item0
   dq item1
   dq item2
   dq item3
   dq item4
   dq item_n
...
item0:
...
item1:
...
item2:
...
item3:
...
item4:
...
    

Code:
shl rax,5
add rax,item0
jmp rax

align 32
item0:
...
align 32
item1:
...
align 32
item2:
...
align 32
item3:
...
align 32
item4:
...    
Post 19 Aug 2022, 09:40
View user's profile Send private message Visit poster's website Reply with quote
FlierMate1



Joined: 31 May 2022
Posts: 118
FlierMate1
I test C# switch statement with Compiler Explorer website, and I found that .NET 6.0.101 uses lookup table (LUT?) to evaluate the value.

Code:
        grade=Console.ReadLine()[0];
       
        switch (grade) {
           case 'A':
              Console.WriteLine("Excellent!");
              break;
           case 'B':
           case 'C':
              Console.WriteLine("Well done");
              break;
           case 'D':
              Console.WriteLine("You passed");
              break;
           case 'F':
              Console.WriteLine("Better try again");
              break;
          default:
           Console.WriteLine("Invalid grade");
              break;
        }    


The HLL code above is translated to the following:

Code:
G_M21826_IG02:
...
...
       mov      edi, edi
       lea      rax, [reloc @RWD00]
       mov      eax, dword ptr [rax+4*rdi]
       lea      rsi, G_M21826_IG02
       add      rax, rsi
       jmp      rax

G_M21826_IG03: ....
G_M21826_IG04: ....
G_M21826_IG05: ....
G_M21826_IG06: ....
G_M21826_IG07: ....


RWD00   dd      G_M21826_IG03 - G_M21826_IG02
        dd      G_M21826_IG04 - G_M21826_IG02    ;grade=B?
        dd      G_M21826_IG04 - G_M21826_IG02    ;grade=C?
        dd      G_M21826_IG05 - G_M21826_IG02
        dd      G_M21826_IG07 - G_M21826_IG02
        dd      G_M21826_IG06 - G_M21826_IG02
    


Oh, it doesn't use repetitive CMP for each "switch case".

If you want to look at the full disassembly, click here: https://godbolt.org/z/xbW5GWvcq
Post 19 Aug 2022, 11:44
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18837
Location: In your JS exploiting you and your system
revolution
Thrre more meaningless data points. But still can't beat the useless_cmp_chain. What am I doing wrong? I can't find the "obvious" solution here. Sad
Code:
macro cmov_for_the_win {
        cmp     al,1
        mov     edx,item1
        cmovz   edi,edx
        cmp     al,2
        mov     edx,item2
        cmovz   edi,edx
        cmp     al,3
        mov     edx,item3
        cmovz   edi,edx
        cmp     al,4
        mov     edx,item4
        cmovz   edi,edx
        jmp     edi
}

macro LUTs_are_awesome {
        movzx   eax,al
        jmp     dword[LUT+(eax-1)*4]
        align 4
LUT:    dd      item1,item2,item3,item4
}

macro computed_goto {
        movzx   eax,al
        lea     eax,[(eax-1)*(item2-item1)+item1]
        jmp     eax
}    
Code:
~ fasm -d TEST=cmov_for_the_win test.asm && time ./test 
flat assembler  version 1.73.08  (4044352 kilobytes memory)
2 passes, 1165 bytes.

real    0m7.130s
user    0m7.012s
sys     0m0.004s
~ fasm -d TEST=LUTs_are_awesome test.asm && time ./test 
flat assembler  version 1.73.08  (4036084 kilobytes memory)
2 passes, 1151 bytes.

real    0m6.460s
user    0m6.376s
sys     0m0.000s
~ fasm -d TEST=computed_goto test.asm && time ./test 
flat assembler  version 1.73.08  (4045368 kilobytes memory)
3 passes, 1135 bytes.

real    0m6.427s
user    0m6.352s
sys     0m0.000s    
Post 19 Aug 2022, 12:10
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1884
Furs
revolution wrote:
Show your improvement.
I don't know how to use your macro but you're doing too many jumps and no fall-throughs. If I were to code it, I'd do it like:
Code:
    cmp     al,3
    jb      @f
    je      item3
item4:
    jmp     next_item
@@:
    cmp     al,1
    je      item1
item2:
    jmp     next_item    
Anyway as I said, you have too little jump choices, 4 is usually the cutoff. Not sure of the point you're trying to make, they seem more or less equal (if you code binary search better) which is why 4 is the cutoff. So I guess testing the obvious.

BTW I purposefully made it so it takes more jumps for 1-2 case, if I wanted to optimize it for that pattern I'd do it the other way around.
Post 19 Aug 2022, 13:15
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18837
Location: In your JS exploiting you and your system
revolution
Thanks. I will pass that on to be tested.
Furs wrote:
... if I wanted to optimize it for that pattern I'd do it the other way around.
So you agree that the data pattern matters for optimisation. Good. I'm glad you saw your error.
Post 19 Aug 2022, 13:28
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3464
Location: vpcmipstrm
bitRAKE
revolution wrote:
Perhaps we should optimise for anti-size, and then get the fastest. Laughing
This happens quite frequently - it is rare for small code to also be fastest*. Tuned specialized cases win the race.

* In the scope of all problem spaces.

_________________
¯\(°_o)/¯ unlicense.org
Post 19 Aug 2022, 13:34
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1884
Furs
revolution wrote:
Thanks. I will pass that on to be tested.
Furs wrote:
... if I wanted to optimize it for that pattern I'd do it the other way around.
So you agree that the data pattern matters for optimisation. Good. I'm glad you saw your error.
Right, I did say it matters if it's extremely skewed.

In fact, I pointed exceptions where binary search is worse (e.g. even said up to 4 to use linear in most cases), but the problem is that if your data is skewed, you should really be using something else like RLE (run-length encoding), where you loop over repeated data, which will be much faster (and take much less space to encode, too).

So, linear search is bad except when data statistical distribution is mostly random, and you have < 5 choices or so (rule of thumb, not exact, but it's around there).
Post 20 Aug 2022, 16:05
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3464
Location: vpcmipstrm
bitRAKE
Henri Poincaré, 1899 wrote:
Logic sometimes makes monsters. For half a century we have seen a mass of bizarre functions which appear to be forced to resemble as little as possible honest functions which serve some purpose. More of continuity, or less of continuity, more derivatives, and so forth. Indeed, from the point of view of logic, these strange functions are the most general; on the other hand those which one meets without searching for them, and which follow simple laws appear as a particular case which does not amount to more than a small corner.

In former times when one invented a new function it was for a practical purpose; today one invents them purposely to show up defects in the reasoning of our fathers and one will deduce from them only that.

If logic were the sole guide of the teacher, it would be necessary to begin with the most general functions, that is to say with the most bizarre. It is the beginner that would have to be set grappling with this teratologic museum.
Unfortunately, we didn't have the gift of excluding pathological cases at the outset. If unstated it seems natural to assume a regular distribution on the input. Knowing that corner cases exist is helpful - especially when our assumption breakdown and we experience the unexpected in testing.

_________________
¯\(°_o)/¯ unlicense.org
Post 20 Aug 2022, 17:26
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18837
Location: In your JS exploiting you and your system
revolution
Furs wrote:
... it matters if it's extremely skewed.
That is the same as saying there isn't an objectively better solution.
Post 20 Aug 2022, 22:46
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18837
Location: In your JS exploiting you and your system
revolution
My colleague came back to me and she commented:
Quote:
Why does Furs keep moving the goalposts?

Why did Furs ignore the question about when we have no control over the data format, and keep recommending RLE?

Why does Furs say it is objective and then give a subjective value of maybe 4, or 5, or something else?

Why does Furs still insist no testing is required for the "obvious", and then say the difference is "more or less equal" when clearly the 50% difference is enormous? Our boss would be over-the-moon to have efficiency gains like that.

Is Furs even a programmer, or is it all just a troll?
I'm not sure about the last question. Furs?
Post 21 Aug 2022, 02:38
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18837
Location: In your JS exploiting you and your system
revolution
We couldn't decode Furs' statement about "skewed", it would appear to be a subjective assessment, so we tried to normalise the data.
Code:
the_data_pattern_is_unimportant:
        while % <= 250
                db 1,2,3,4
        end while
the_data_pattern_is_unimportant_len = $ - the_data_pattern_is_unimportant    
Results
Code:
~ fasm -d TEST=uselss_cmp_chain test.asm && time ./test
flat assembler  version 1.73.08  (4053804 kilobytes memory)
2 passes, 1139 bytes.

real    0m4.551s
user    0m4.460s
sys     0m0.008s
~ fasm -d TEST=objectively_better_binary test.asm && time ./test
flat assembler  version 1.73.08  (4040788 kilobytes memory)
2 passes, 1137 bytes.

real    0m5.197s
user    0m5.112s
sys     0m0.000s
~ fasm -d TEST=four_way_junction test.asm && time ./test
flat assembler  version 1.73.08  (4037252 kilobytes memory)
2 passes, 1133 bytes.

real    0m7.139s
user    0m7.004s
sys     0m0.000s
~ fasm -d TEST=hack_1234_chain_1 test.asm && time ./test
flat assembler  version 1.73.08  (4022056 kilobytes memory)
2 passes, 1133 bytes.

real    0m6.521s
user    0m6.416s
sys     0m0.000s
~ fasm -d TEST=hack_1234_chain_2 test.asm && time ./test
flat assembler  version 1.73.08  (4053372 kilobytes memory)
2 passes, 1133 bytes.

real    0m6.505s
user    0m6.416s
sys     0m0.000s
~ fasm -d TEST=hack_1234_chain_3 test.asm && time ./test
flat assembler  version 1.73.08  (4028416 kilobytes memory)
2 passes, 1133 bytes.

real    0m6.508s
user    0m6.424s
sys     0m0.004s
~ fasm -d TEST=cmov_for_the_win test.asm && time ./test
flat assembler  version 1.73.08  (4038796 kilobytes memory)
2 passes, 1165 bytes.

real    0m7.195s
user    0m7.016s
sys     0m0.004s
~ fasm -d TEST=LUTs_are_awesome test.asm && time ./test
flat assembler  version 1.73.08  (4025676 kilobytes memory)
2 passes, 1151 bytes.

real    0m6.490s
user    0m6.364s
sys     0m0.004s
~ fasm -d TEST=computed_goto test.asm && time ./test
flat assembler  version 1.73.08  (4038980 kilobytes memory)
3 passes, 1135 bytes.

real    0m6.467s
user    0m6.352s
sys     0m0.004s
~ fasm -d TEST=uber_no_need_to_test_code_from_Furs test.asm && time ./test
flat assembler  version 1.73.08  (4053596 kilobytes memory)
2 passes, 1137 bytes.

real    0m7.146s
user    0m7.008s
sys     0m0.004s    
The last test is this
Code:
macro uber_no_need_to_test_code_from_Furs {
        cmp     al,3
        jb      @f
        je      item3
item4x: jmp     next_item
@@:     cmp     al,1
        je      item1
item2x: jmp     next_item
}    
Furs wrote:
Some things are just obvious.
...
No test needed Wink
Post 21 Aug 2022, 04:01
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 648
Location: Russia
macomics
Could you attach all the macros that were tested with a file?
Post 21 Aug 2022, 06:17
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5, 6  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.