flat assembler
Message board for the users of flat assembler.

Index > Main > 256 possible order. What is the fastest code?

Goto page 1, 2, 3, 4, 5, 6  Next
Author
Thread Post new topic Reply to topic
Fastestcodes



Joined: 13 Jun 2022
Posts: 75
Fastestcodes
Code:
; byte in al
cmp al,00h
je La00
cmp al,01h
je La01
....
cmp al,0ffh
je Laff

La00:
do order1
La01:
do order2
Laff:
do order256
    

What is the fastest code? jb, je, ja 080h...?
Post 12 Aug 2022, 03:06
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18846
Location: In your JS exploiting you and your system
revolution
Fastestcodes wrote:
What is the fastest ...?
There is no definitive answer to your question.

It depends upon too many factors. And most of the relevant factors are hidden, unmeasurable and unknowable, and change frequently.
Post 12 Aug 2022, 03:34
View user's profile Send private message Visit poster's website Reply with quote
ProMiNick



Joined: 24 Mar 2012
Posts: 688
Location: Russian Federation, Sochi
ProMiNick
Code:
cmp     eax, 255  ;movzx eax,al
ja      notjmptbl ;
jmp     dword[jmptbl+eax*4]; here eax is always <256, no matter filtered eax by compare or zeroed high part out of al
jmptbl:
        dd La00
        dd La01
        ...
        dd LaFF
notjmptbl:    
Post 12 Aug 2022, 06:24
View user's profile Send private message Send e-mail Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18846
Location: In your JS exploiting you and your system
revolution
For simplicity I would probably go with a basic LUT, as posted above, but it might not be "faster".

If you are really really sure that this code is the major bottleneck in your application then you might find that a mixture of cmp chain and LUT could work well. Use the cmp chain for the most frequent values, and drop back to a LUT for the rest.

Alternatively, a binary search might work well for some data access patterns on some CPU types with good branch history caches.

Sometimes aligning entry and/or exit points can be useful, depending upon your overall L1 cache usage with the rest of the code.

Anyhow, try out all the options and see which works best with your data, on your systems. Occasionally what seems like bad code can work out to be really good. But you won't know until you test each of them in the final app.
Post 12 Aug 2022, 06:49
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1888
Furs
Fastestcodes wrote:
Code:
; byte in al
cmp al,00h
je La00
cmp al,01h
je La01
....
cmp al,0ffh
je Laff

La00:
do order1
La01:
do order2
Laff:
do order256
    

What is the fastest code? jb, je, ja 080h...?
If you're going with a long jump chain instead of LUT, always use binary search. It's objectively better unless you have less than 4 entries.

This applies to all constant data known at compile time where you can pre-sort it ahead before even running the app.
Post 12 Aug 2022, 13:11
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18846
Location: In your JS exploiting you and your system
revolution
Furs wrote:
... always use binary search. It's objectively better unless you have less than 4 entries.
I have a problem with this statement. It isn't objectively better when things like branch prediction exist. The particular data patterns encountered in the app can make such sweeping objective statements invalid.

We can't know until we test it, on real data, in the real app.
Post 12 Aug 2022, 13:32
View user's profile Send private message Visit poster's website Reply with quote
Fastestcodes



Joined: 13 Jun 2022
Posts: 75
Fastestcodes
Less cmp?

Code:
cmp al,07fh
jb L007e
ja L80fe
;L7f:

L007e:
cmp al,03fh
jb L003e
ja L407e
;L3f:

L003e:
cmp al,01fh
jb L001e
ja L203e
;L1f:

L001e:
cmp al,00fh
jb L000e
ja L101e
;L0f:

L000e:
cmp al,007h
jb L0006
ja L080e
;L07:

L0006:
cmp al,003h
jb L0002
ja L0406
;L03:

L0002:
cmp al,001h
jb L00
ja L02
;L01
L00:
L02:

L0406:
cmp al,005h
jb L04
ja L06
;L05:
L04:
L06:


L080e:
cmp al,00bh
jb L080a
ja L0c0e
;L0b:

L080a:
cmp al,009h
jb L08
ja L0a
;L09:
L08:
L0a:

L0c0e:
cmp al,00dh
jb L0c
ja L0e
;L0d:
L0c:
L0e:

L101e:
cmp al,017h
jb L1016
ja L181e
;L17:

L1016:
cmp al,013h
jb L1012
ja L1416
;L13:

L1012:
cmp al,011h
jb L10
ja L12
;L11
L10:
L12:

L1416:
cmp al,015h
jb L14
ja L16
;L15:
L14:
L16:


L181e:
cmp al,01bh
jb L181a
ja L1c1e
;L1b:

L181a:
cmp al,019h
jb L18
ja L1a
;L19:
L18:
L01a:

L1c1e:
cmp al,01dh
jb L1c
ja L1e
;L1d:
L1c:
L1e:

L203e:
cmp al,02fh
jb L202e
ja L303e
;L2f:

L202e:
cmp al,27h
jb L2026
ja L282e
;L27:

L2026:
cmp al,023h
jb L2022
ja L2426
;L23:

L2022:
cmp al,021h
jb L20
ja L22
;L21
L20:
L22:

L2426:
cmp al,25h
jb L24
ja L26
;L25:
L24:
L26:


L282e:
cmp al,2bh
jb L282a
ja L2c2e
;L2b:

L282a:
cmp al,29h
jb L28
ja L2a
;L29:
L28:
L2a:

L2c2e:
cmp al,02dh
jb L2c
ja L2e
;L2d:
L2c:
L2e:

L303e:
cmp al,037h
jb L3036
ja L383e
;L37:

L3036:
cmp al,033h
jb L3032
ja L3436
;L33:

L3032:
cmp al,031h
jb L30
ja L32
;L31
L30:
L32:

L3436:
cmp al,035h
jb L34
ja L36
;L35:
L34:
L36:


L383e:
cmp al,03bh
jb L383a
ja L3c3e
;L3b:

L383a:
cmp al,039h
jb L38
ja L3a
;L39:
L38:
L3a:

L3c3e:
cmp al,03dh
jb L03c
ja L03e
;L3d:
L3c:
L3e:


L407e:
cmp al,05fh
jb L405e
ja L607e
;L5f:

L405e:
cmp al,04fh
jb L404e
ja L505e
;L4f:

L404e:
cmp al,047h
jb L4046
ja L484e
;L47:

L4046:
cmp al,043h
jb L4042
ja L4446
;L43:

L4042:
cmp al,041h
jb L40
ja L42
;L41
L40:
L42:

L4446:
cmp al,045h
jb L44
ja L46
;L45:
L44:
L46:


L484e:
cmp al,04bh
jb L484a
ja L4c4e
;L4b:

L484a:
cmp al,49h
jb L48
ja L4a
;L49:
L48:
L4a:

L4c4e:
cmp al,04dh
jb L4c
ja L4e
;L4d:
L4c:
L4e:

L505e:
cmp al,057h
jb L5056
ja L585e
;L57:

L5056:
cmp al,053h
jb L5052
ja L5456
;L53:

L5052:
cmp al,051h
jb L50
ja L52
;L51
L50:
L52:

L5456:
cmp al,055h
jb L54
ja L56
;L55:
L54:
L56:


L585e:
cmp al,05bh
jb L585a
ja L5c5e
;L5b:

L585a:
cmp al,059h
jb L58
ja L5a
;L59:
L58:
L5a:

L5c5e:
cmp al,05dh
jb L5c
ja L5e
;L5d:
L5c:
L5e:

L607e:
cmp al,06fh
jb L606e
ja L707e
;L6f:

L606e:
cmp al,067h
jb L6066
ja L686e
;L67:

L6066:
cmp al,063h
jb L6062
ja L6466
;L63:

L6062:
cmp al,061h
jb L60
ja L62
;L61
L60:
L62:

L6466:
cmp al,065h
jb L64
ja L66
;L65:
L64:
L66:


L686e:
cmp al,06bh
jb L686a
ja L6c6e
;L6b:

L686a:
cmp al,069h
jb L68
ja L6a
;L69:
L68:
L6a:

L6c6e:
cmp al,06dh
jb L6c
ja L6e
;L6d:
L6c:
L6e:

L707e:
cmp al,077h
jb L7076
ja L787e
;L77:

L7076:
cmp al,073h
jb L7072
ja L7476
;L73:

L7072:
cmp al,071h
jb L70
ja L72
;L71
L70:
L72:

L7476:
cmp al,075h
jb L74
ja L76
;L75:
L74:
L76:


L787e:
cmp al,07bh
jb L787a
ja L7c7e
;L7b:

L787a:
cmp al,079h
jb L78
ja L7a
;L79:
L78:
L7a:

L7c7e:
cmp al,07dh
jb L7c
ja L7e
;L7d:
L7c:
L7e:




L80fe:
cmp al,0bfh
jb L80be
ja Lc0fe
;Lbf:

L80be:
cmp al,09fh
jb L809e
ja La0fe
;L9f:

L809e:
cmp al,08fh
jb L808e
ja L909e
;L8f:

L808e:
cmp al,087h
jb L8086
ja L888e
;L87:

L8086:
cmp al,083h
jb L8082
ja L8486
;L83:

L8082:
cmp al,081h
jb L80
ja L82
;L81
L80:
L82:

L8486:
cmp al,085h
jb L84
ja L86
;L85:
L84:
L86:


L888e:
cmp al,08bh
jb L888a
ja L8c8e
;L8b:

L888a:
cmp al,089h
jb L88
ja L8a
;L89:
L88:
L8a:

L8c8e:
cmp al,08dh
jb L8c
ja L8e
;L8d:
L8c:
L8e:


L909e:
cmp al,097h
jb L9096
ja L989e
;L97:

L9096:
cmp al,093h
jb L9092
ja L9496
;L93:

L9092:
cmp al,091h
jb L90
ja L92
;L91
L90:
L92:

L9496:
cmp al,095h
jb L94
ja L96
;L95:
L94:
L96:


L989e:
cmp al,09bh
jb L989a
ja L9c9e
;L9b:

L989a:
cmp al,099h
jb L98
ja L9a
;L99:
L98:
L9a:

L9c9e:
cmp al,09dh
jb L9c
ja L9e
;L9d:
L9c:
L9e:

La0be:
cmp al,0afh
jb La0ae
ja Lb0be
;Laf:

La0ae:
cmp al,0a7h
jb La0a6
ja La8ae
;La7:

La0a6:
cmp al,0a3h
jb La0a2
ja La4a6
;La3:

La0a2:
cmp al,0a1h
jb La0
ja La2
;La1
La0:
La2:

La4a6:
cmp al,0a5h
jb La4
ja La6
;La5:
La4:
La6:


La8ae:
cmp al,0abh
jb La8aa
ja Lacae
;Lab:

La8aa:
cmp al,0a9h
jb La8
ja Laa
;La9:
La8:
Laa:

Lacae:
cmp al,adh
jb Lac
ja Lae
;Lad:
Lac:
Lae:

Lb0be:
cmp al,0b7h
jb Lb0b6
ja Lb8be
;Lb7:

Lb0b6:
cmp al,0b3h
jb Lb0b2
ja Lb4b6
;Lb3:

Lb0b2:
cmp al,0b1h
jb Lb0
ja Lb2
;Lb1
Lb0:
Lb2:

Lb4b6:
cmp al,0b5h
jb Lb4
ja Lb6
;Lb5:
Lb4:
Lb6:


Lb8be:
cmp al,0bbh
jb Lb8ba
ja Lbcbe
;Lbb:

Lb8ba:
cmp al,0b9h
jb L0b8
ja L0ba
;Lb9:
Lb8:
Lba:

Lbcbe:
cmp al,0bdh
jb L0bc
ja L0be
;Lbd:
Lbc:
Lbe:


Lc0fe:
cmp al,0dfh
jb Lc0de
ja Le0fe
;Ldf:

Lc0de:
cmp al,0cfh
jb Lc0ce
ja Ld0de
;Lcf:

Lc0ce:
cmp al,0c7h
jb Lc0c6
ja Lc8ce
;Lc7:

Lc0c6:
cmp al,0c3h
jb Lc0c2
ja Lc4c6
;Lc3:

Lc0c2:
cmp al,0c1h
jb Lc0
ja Lc2
;Lc1
Lc0:
Lc2:

Lc4c6:
cmp al,0c5h
jb Lc4
ja Lc6
;Lc5:
Lc4:
Lc6:


Lc8ce:
cmp al,0cbh
jb Lc8ca
ja Lccce
;Lcb:

Lc8ca:
cmp al,c9h
jb Lc8
ja Lca
;Lc9:
Lc8:
Lca:

Lccce:
cmp al,0cdh
jb Lcc
ja Lce
;Lcd:
Lcc:
Lce:

Ld0de:
cmp al,0d7h
jb Ld0d6
ja Ld8de
;Ld7:

Ld0d6:
cmp al,0d3h
jb Ld0d2
ja Ld4d6
;Ld3:

Ld0d2:
cmp al,0d1h
jb Ld0
ja Ld2
;Ld1
Ld0:
Ld2:

Ld4d6:
cmp al,0d5h
jb Ld4
ja Ld6
;Ld5:
Ld4:
Ld6:


Ld8de:
cmp al,0dbh
jb Ld8da
ja Ldcde
;Ldb:

Ld8da:
cmp al,0d9h
jb Ld8
ja Lda
;Ld9:
Ld8:
Lda:

Ldcde:
cmp al,0ddh
jb Ldc
ja Lde
;Ldd:
Ldc:
Lde:

Le0fe:
cmp al,0efh
jb Le0ee
ja Lf0fe
;Lef:

L6
e0ee:
cmp al,0e7h
jb Le0e6
ja Le8ee
;Le7:

Le0e6:
cmp al,0e3h
jb Le0e2
ja Le4e6
;Le3:

Le0e2:
cmp al,0e1h
jb Le0
ja Le2
;Le1
Le0:
Le2:

Le4e6:
cmp al,0e5h
jb Le4
ja Le6
;Le5:
Le4:
Le6:


Le8ee:
cmp al,0ebh
jb Le8ea
ja Lecee
;Leb:

Le8ea:
cmp al,0e9h
jb Le8
ja Lea
;Le9:
Le8:
Lea:

Lecee:
cmp al,0edh
jb Lec
ja Lee
;Led:
Lec:
Lee:

Lf0fe:
cmp al,0f7h
jb Lf0f6
ja Lf8fe
;Lf7:

Lf0f6:
cmp al,0f3h
jb Lf0f2
ja Lf4f6
;Lf3:

Lf0f2:
cmp al,0f1h
jb Lf0
ja Lf2
;Lf1
Lf0:
Lf2:

Lf4f6:
cmp al,0f5h
jb Lf4
ja Lf6
;Lf5:
Lf4:
Lf6:


Lf8fe:
cmp al,0fbh
jb Lf8fa
ja Lfcfe
;Lfb:

Lf8fa:
cmp al,0f9h
jb Lf8
ja Lfa
;Lf9:
Lf8:
Lfa:

Lfcfe:
cmp al,0fdh
jb Lfc
ja Lfe
;Lfd:
Lfc:
Lfe:
cmp al,0ffh
je Lff

Lff:
    
Post 12 Aug 2022, 13:37
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3489
Location: vpcmipstrm
bitRAKE
Some processors have a problem predicting too many jumps in a short span of code.

Large jump tables can be a problem with L1 data & code cache. Yet, this really depends on the overall code - maybe it all fits in the cache, maybe the execution paths density is very sparse?

Here is a trick:
Code:
lea rbx,[JMP_00] ; a global register, outside of loop?
...
mov bh,al
jmp rbx ; or CALL RBX

...

align 1 shl 16 ; a lesser condition is sufficient
JMP_00:
...
; 256 bytes for each jump target
...
align 256
JMP_FF:    
... this trick (and similar) has actually been used in many emulator codes. [I first saw it implemented in a Z80 emulator in 68k asm some 30+ years ago.] We have the benefit of code alignment and no data cache pressure = it is fast.

This method has a baby brother.

The emulators moved away from individual dispatch when dynamic re-compilers became popular - higher latency, but greater throughput.

[edit] I remembered an earlier reference, Steve "Woz" Wozniak's SWEET16 used both a data table and code aligned in memory for virtual instruction dispatch.

_________________
¯\(°_o)/¯ unlicense.org
Post 12 Aug 2022, 14:19
View user's profile Send private message Visit poster's website Reply with quote
Fastestcodes



Joined: 13 Jun 2022
Posts: 75
Fastestcodes
"lea rbx,[JMP_00] ; If the JMP_00 0000000012243648h
...
mov bh,al ; rbx will be 0000000012240048 to 000000001224ff48
Why "align 1 shl 16" and "align 256"? Your code seems short and fast.
Post 13 Aug 2022, 05:29
View user's profile Send private message Reply with quote
Fastestcodes



Joined: 13 Jun 2022
Posts: 75
Fastestcodes
bitRAKE wrote:
Some processors have a problem predicting too many jumps in a short span of code.

Large jump tables can be a problem with L1 data & code cache. Yet, this really depends on the overall code - maybe it all fits in the cache, maybe the execution paths density is very sparse?

Here is a trick:
Code:
lea rbx,[JMP_00] ; a global register, outside of loop?
...
mov bh,al
jmp rbx ; or CALL RBX

...

align 1 shl 16 ; a lesser condition is sufficient
JMP_00:
...
; 256 bytes for each jump target
...
align 256
JMP_FF:    
... this trick (and similar) has actually been used in many emulator codes. [I first saw it implemented in a Z80 emulator in 68k asm some 30+ years ago.] We have the benefit of code alignment and no data cache pressure = it is fast.

This method has a baby brother.

The emulators moved away from individual dispatch when dynamic re-compilers became popular - higher latency, but greater throughput.

[edit] I remembered an earlier reference, Steve "Woz" Wozniak's SWEET16 used both a data table and code aligned in memory for virtual instruction dispatch.


How mod this code for 65536 orders?
Post 13 Aug 2022, 08:06
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 652
Location: Russia
macomics
Code:
; ax - select, cl - align (N)
lea rbx, [JMP_0000]
ror rbx, cl
mov bx, ax
rol rbx, cl
jmp rbx

...

align 1 shl (16 + N)
JMP_0000:
...
; 2^N bytes for each jump target
...
align N
JMP_FFFF:
...    
Post 13 Aug 2022, 08:12
View user's profile Send private message Reply with quote
Fastestcodes



Joined: 13 Jun 2022
Posts: 75
Fastestcodes
Thx all.
Post 13 Aug 2022, 09:39
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 652
Location: Russia
macomics
With N = 8, only a block of such labels from JMP_0000 to JMP_FFFF will take about 16 Mb!
Post 13 Aug 2022, 10:10
View user's profile Send private message Reply with quote
Fastestcodes



Joined: 13 Jun 2022
Posts: 75
Fastestcodes
macomics wrote:
With N = 8, only a block of such labels from JMP_0000 to JMP_FFFF will take about 16 Mb!

Yes, and 65536 order only. Not a too difficult machnine. Artifical Unintelligence. Smile
Post 13 Aug 2022, 11:21
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3489
Location: vpcmipstrm
bitRAKE
Fastestcodes wrote:
"lea rbx,[JMP_00] ; If the JMP_00 0000000012243648h
...
mov bh,al ; rbx will be 0000000012240048 to 000000001224ff48
Why "align 1 shl 16" and "align 256"? Your code seems short and fast.
ALIGN 0x10000 would insure low word is zero. Under Windows we can predict the low word of address at assemble-time. Try it. Even with relocation there are rules forcing the low word.

Another thing we can examine is value of BL. Testing will show that code alignment greater than 16 does very little - maybe, save a cacheline. So, the value of BL has several options - as long as relative offset of branches is 256.


On the 65536 case ...
We could be super ambitious and maybe pass with N=3, 512k. One has 64k cases to optimize/automate. Confused At this point all code caching has been nullified. There are probably better solutions. I have been known to slap together wonky solutions - this type of thing is not beyond me.

_________________
¯\(°_o)/¯ unlicense.org
Post 13 Aug 2022, 11:38
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18846
Location: In your JS exploiting you and your system
revolution
I can't imagine any real world code that needs a 64k-wide branch selection.

It would be awesome fun to have write, and debug, and test, and verify all of those individual 64k branches of code. One could complete 10 per day and it would be done in just under 18 years. Laughing Good luck.
Post 13 Aug 2022, 11:58
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4272
Location: Now
edfed
look up table is the best choice
but not only, you can add as many indiretion as you want


for example if you have a function table, and each function entry is a structure with a list of versions(1.0, coder1, coder2 ...), modes( debug,sandbox...) , and so on...

Code:
mov eax,[functable+eax*4]
mov eax,[eax+functionmode]
mov ebx,[functionversion]
mov eax,[eax+ebx]
call [eax]
    



if you make a switchcase, it is ugly when you get many entries but very convenient when you want to code fast and easy like dispatchmessage in windows process.
Post 13 Aug 2022, 15:25
View user's profile Send private message Visit poster's website Reply with quote
Overclick



Joined: 11 Jul 2020
Posts: 577
Location: Ukraine
Overclick
Also it can be some parallel calculations instead of that number of labels (SSE).
I do that for 256 virtual channels by 480 factor variables peak cutters and normalize algorithm.
Post 13 Aug 2022, 16:11
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1888
Furs
revolution wrote:
Furs wrote:
... always use binary search. It's objectively better unless you have less than 4 entries.
I have a problem with this statement. It isn't objectively better when things like branch prediction exist. The particular data patterns encountered in the app can make such sweeping objective statements invalid.

We can't know until we test it, on real data, in the real app.
But it takes way less branches so it's far easier to predict?

Really the only time it would not be better is if 90% of your data is in the first or second value checked linearly or so.

And even in that case I doubt it's worse, because what you mentioned, branch prediction, actually helps binary search more, not goes against it.
Post 13 Aug 2022, 17:11
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 652
Location: Russia
macomics
Furs wrote:
Furs wrote:
... always use binary search.

But it takes way less branches so it's far easier to predict?

Here's a simple task for you to think about. Given 2 identical glass balls and a skyscraper on N floors. It is known for sure that the ball definitely breaks when thrown from the last floor, and also that both balls break when thrown from the same floor. Find the minimum floor for the minimum number of throws from which the ball will break when thrown?

Solving this problem, it may become clear to you why C/Pascal compilers do not generate a large number of jumps by creating a binary tree when generating a switch/case.


Last edited by macomics on 13 Aug 2022, 21:54; edited 1 time in total
Post 13 Aug 2022, 17:56
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2, 3, 4, 5, 6  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.