flat assembler
Message board for the users of flat assembler.
![]() Goto page 1, 2, 3, 4, 5, 6 Next |
Author |
|
revolution 12 Aug 2022, 03:34
Fastestcodes wrote: What is the fastest ...? It depends upon too many factors. And most of the relevant factors are hidden, unmeasurable and unknowable, and change frequently. |
|||
![]() |
|
ProMiNick 12 Aug 2022, 06:24
Code: cmp eax, 255 ;movzx eax,al ja notjmptbl ; jmp dword[jmptbl+eax*4]; here eax is always <256, no matter filtered eax by compare or zeroed high part out of al jmptbl: dd La00 dd La01 ... dd LaFF notjmptbl: |
|||
![]() |
|
revolution 12 Aug 2022, 06:49
For simplicity I would probably go with a basic LUT, as posted above, but it might not be "faster".
If you are really really sure that this code is the major bottleneck in your application then you might find that a mixture of cmp chain and LUT could work well. Use the cmp chain for the most frequent values, and drop back to a LUT for the rest. Alternatively, a binary search might work well for some data access patterns on some CPU types with good branch history caches. Sometimes aligning entry and/or exit points can be useful, depending upon your overall L1 cache usage with the rest of the code. Anyhow, try out all the options and see which works best with your data, on your systems. Occasionally what seems like bad code can work out to be really good. But you won't know until you test each of them in the final app. |
|||
![]() |
|
Furs 12 Aug 2022, 13:11
Fastestcodes wrote:
This applies to all constant data known at compile time where you can pre-sort it ahead before even running the app. |
|||
![]() |
|
revolution 12 Aug 2022, 13:32
Furs wrote: ... always use binary search. It's objectively better unless you have less than 4 entries. We can't know until we test it, on real data, in the real app. |
|||
![]() |
|
Fastestcodes 12 Aug 2022, 13:37
Less cmp?
Code: cmp al,07fh jb L007e ja L80fe ;L7f: L007e: cmp al,03fh jb L003e ja L407e ;L3f: L003e: cmp al,01fh jb L001e ja L203e ;L1f: L001e: cmp al,00fh jb L000e ja L101e ;L0f: L000e: cmp al,007h jb L0006 ja L080e ;L07: L0006: cmp al,003h jb L0002 ja L0406 ;L03: L0002: cmp al,001h jb L00 ja L02 ;L01 L00: L02: L0406: cmp al,005h jb L04 ja L06 ;L05: L04: L06: L080e: cmp al,00bh jb L080a ja L0c0e ;L0b: L080a: cmp al,009h jb L08 ja L0a ;L09: L08: L0a: L0c0e: cmp al,00dh jb L0c ja L0e ;L0d: L0c: L0e: L101e: cmp al,017h jb L1016 ja L181e ;L17: L1016: cmp al,013h jb L1012 ja L1416 ;L13: L1012: cmp al,011h jb L10 ja L12 ;L11 L10: L12: L1416: cmp al,015h jb L14 ja L16 ;L15: L14: L16: L181e: cmp al,01bh jb L181a ja L1c1e ;L1b: L181a: cmp al,019h jb L18 ja L1a ;L19: L18: L01a: L1c1e: cmp al,01dh jb L1c ja L1e ;L1d: L1c: L1e: L203e: cmp al,02fh jb L202e ja L303e ;L2f: L202e: cmp al,27h jb L2026 ja L282e ;L27: L2026: cmp al,023h jb L2022 ja L2426 ;L23: L2022: cmp al,021h jb L20 ja L22 ;L21 L20: L22: L2426: cmp al,25h jb L24 ja L26 ;L25: L24: L26: L282e: cmp al,2bh jb L282a ja L2c2e ;L2b: L282a: cmp al,29h jb L28 ja L2a ;L29: L28: L2a: L2c2e: cmp al,02dh jb L2c ja L2e ;L2d: L2c: L2e: L303e: cmp al,037h jb L3036 ja L383e ;L37: L3036: cmp al,033h jb L3032 ja L3436 ;L33: L3032: cmp al,031h jb L30 ja L32 ;L31 L30: L32: L3436: cmp al,035h jb L34 ja L36 ;L35: L34: L36: L383e: cmp al,03bh jb L383a ja L3c3e ;L3b: L383a: cmp al,039h jb L38 ja L3a ;L39: L38: L3a: L3c3e: cmp al,03dh jb L03c ja L03e ;L3d: L3c: L3e: L407e: cmp al,05fh jb L405e ja L607e ;L5f: L405e: cmp al,04fh jb L404e ja L505e ;L4f: L404e: cmp al,047h jb L4046 ja L484e ;L47: L4046: cmp al,043h jb L4042 ja L4446 ;L43: L4042: cmp al,041h jb L40 ja L42 ;L41 L40: L42: L4446: cmp al,045h jb L44 ja L46 ;L45: L44: L46: L484e: cmp al,04bh jb L484a ja L4c4e ;L4b: L484a: cmp al,49h jb L48 ja L4a ;L49: L48: L4a: L4c4e: cmp al,04dh jb L4c ja L4e ;L4d: L4c: L4e: L505e: cmp al,057h jb L5056 ja L585e ;L57: L5056: cmp al,053h jb L5052 ja L5456 ;L53: L5052: cmp al,051h jb L50 ja L52 ;L51 L50: L52: L5456: cmp al,055h jb L54 ja L56 ;L55: L54: L56: L585e: cmp al,05bh jb L585a ja L5c5e ;L5b: L585a: cmp al,059h jb L58 ja L5a ;L59: L58: L5a: L5c5e: cmp al,05dh jb L5c ja L5e ;L5d: L5c: L5e: L607e: cmp al,06fh jb L606e ja L707e ;L6f: L606e: cmp al,067h jb L6066 ja L686e ;L67: L6066: cmp al,063h jb L6062 ja L6466 ;L63: L6062: cmp al,061h jb L60 ja L62 ;L61 L60: L62: L6466: cmp al,065h jb L64 ja L66 ;L65: L64: L66: L686e: cmp al,06bh jb L686a ja L6c6e ;L6b: L686a: cmp al,069h jb L68 ja L6a ;L69: L68: L6a: L6c6e: cmp al,06dh jb L6c ja L6e ;L6d: L6c: L6e: L707e: cmp al,077h jb L7076 ja L787e ;L77: L7076: cmp al,073h jb L7072 ja L7476 ;L73: L7072: cmp al,071h jb L70 ja L72 ;L71 L70: L72: L7476: cmp al,075h jb L74 ja L76 ;L75: L74: L76: L787e: cmp al,07bh jb L787a ja L7c7e ;L7b: L787a: cmp al,079h jb L78 ja L7a ;L79: L78: L7a: L7c7e: cmp al,07dh jb L7c ja L7e ;L7d: L7c: L7e: L80fe: cmp al,0bfh jb L80be ja Lc0fe ;Lbf: L80be: cmp al,09fh jb L809e ja La0fe ;L9f: L809e: cmp al,08fh jb L808e ja L909e ;L8f: L808e: cmp al,087h jb L8086 ja L888e ;L87: L8086: cmp al,083h jb L8082 ja L8486 ;L83: L8082: cmp al,081h jb L80 ja L82 ;L81 L80: L82: L8486: cmp al,085h jb L84 ja L86 ;L85: L84: L86: L888e: cmp al,08bh jb L888a ja L8c8e ;L8b: L888a: cmp al,089h jb L88 ja L8a ;L89: L88: L8a: L8c8e: cmp al,08dh jb L8c ja L8e ;L8d: L8c: L8e: L909e: cmp al,097h jb L9096 ja L989e ;L97: L9096: cmp al,093h jb L9092 ja L9496 ;L93: L9092: cmp al,091h jb L90 ja L92 ;L91 L90: L92: L9496: cmp al,095h jb L94 ja L96 ;L95: L94: L96: L989e: cmp al,09bh jb L989a ja L9c9e ;L9b: L989a: cmp al,099h jb L98 ja L9a ;L99: L98: L9a: L9c9e: cmp al,09dh jb L9c ja L9e ;L9d: L9c: L9e: La0be: cmp al,0afh jb La0ae ja Lb0be ;Laf: La0ae: cmp al,0a7h jb La0a6 ja La8ae ;La7: La0a6: cmp al,0a3h jb La0a2 ja La4a6 ;La3: La0a2: cmp al,0a1h jb La0 ja La2 ;La1 La0: La2: La4a6: cmp al,0a5h jb La4 ja La6 ;La5: La4: La6: La8ae: cmp al,0abh jb La8aa ja Lacae ;Lab: La8aa: cmp al,0a9h jb La8 ja Laa ;La9: La8: Laa: Lacae: cmp al,adh jb Lac ja Lae ;Lad: Lac: Lae: Lb0be: cmp al,0b7h jb Lb0b6 ja Lb8be ;Lb7: Lb0b6: cmp al,0b3h jb Lb0b2 ja Lb4b6 ;Lb3: Lb0b2: cmp al,0b1h jb Lb0 ja Lb2 ;Lb1 Lb0: Lb2: Lb4b6: cmp al,0b5h jb Lb4 ja Lb6 ;Lb5: Lb4: Lb6: Lb8be: cmp al,0bbh jb Lb8ba ja Lbcbe ;Lbb: Lb8ba: cmp al,0b9h jb L0b8 ja L0ba ;Lb9: Lb8: Lba: Lbcbe: cmp al,0bdh jb L0bc ja L0be ;Lbd: Lbc: Lbe: Lc0fe: cmp al,0dfh jb Lc0de ja Le0fe ;Ldf: Lc0de: cmp al,0cfh jb Lc0ce ja Ld0de ;Lcf: Lc0ce: cmp al,0c7h jb Lc0c6 ja Lc8ce ;Lc7: Lc0c6: cmp al,0c3h jb Lc0c2 ja Lc4c6 ;Lc3: Lc0c2: cmp al,0c1h jb Lc0 ja Lc2 ;Lc1 Lc0: Lc2: Lc4c6: cmp al,0c5h jb Lc4 ja Lc6 ;Lc5: Lc4: Lc6: Lc8ce: cmp al,0cbh jb Lc8ca ja Lccce ;Lcb: Lc8ca: cmp al,c9h jb Lc8 ja Lca ;Lc9: Lc8: Lca: Lccce: cmp al,0cdh jb Lcc ja Lce ;Lcd: Lcc: Lce: Ld0de: cmp al,0d7h jb Ld0d6 ja Ld8de ;Ld7: Ld0d6: cmp al,0d3h jb Ld0d2 ja Ld4d6 ;Ld3: Ld0d2: cmp al,0d1h jb Ld0 ja Ld2 ;Ld1 Ld0: Ld2: Ld4d6: cmp al,0d5h jb Ld4 ja Ld6 ;Ld5: Ld4: Ld6: Ld8de: cmp al,0dbh jb Ld8da ja Ldcde ;Ldb: Ld8da: cmp al,0d9h jb Ld8 ja Lda ;Ld9: Ld8: Lda: Ldcde: cmp al,0ddh jb Ldc ja Lde ;Ldd: Ldc: Lde: Le0fe: cmp al,0efh jb Le0ee ja Lf0fe ;Lef: L6 e0ee: cmp al,0e7h jb Le0e6 ja Le8ee ;Le7: Le0e6: cmp al,0e3h jb Le0e2 ja Le4e6 ;Le3: Le0e2: cmp al,0e1h jb Le0 ja Le2 ;Le1 Le0: Le2: Le4e6: cmp al,0e5h jb Le4 ja Le6 ;Le5: Le4: Le6: Le8ee: cmp al,0ebh jb Le8ea ja Lecee ;Leb: Le8ea: cmp al,0e9h jb Le8 ja Lea ;Le9: Le8: Lea: Lecee: cmp al,0edh jb Lec ja Lee ;Led: Lec: Lee: Lf0fe: cmp al,0f7h jb Lf0f6 ja Lf8fe ;Lf7: Lf0f6: cmp al,0f3h jb Lf0f2 ja Lf4f6 ;Lf3: Lf0f2: cmp al,0f1h jb Lf0 ja Lf2 ;Lf1 Lf0: Lf2: Lf4f6: cmp al,0f5h jb Lf4 ja Lf6 ;Lf5: Lf4: Lf6: Lf8fe: cmp al,0fbh jb Lf8fa ja Lfcfe ;Lfb: Lf8fa: cmp al,0f9h jb Lf8 ja Lfa ;Lf9: Lf8: Lfa: Lfcfe: cmp al,0fdh jb Lfc ja Lfe ;Lfd: Lfc: Lfe: cmp al,0ffh je Lff Lff: |
|||
![]() |
|
bitRAKE 12 Aug 2022, 14:19
Some processors have a problem predicting too many jumps in a short span of code.
Large jump tables can be a problem with L1 data & code cache. Yet, this really depends on the overall code - maybe it all fits in the cache, maybe the execution paths density is very sparse? Here is a trick: Code: lea rbx,[JMP_00] ; a global register, outside of loop? ... mov bh,al jmp rbx ; or CALL RBX ... align 1 shl 16 ; a lesser condition is sufficient JMP_00: ... ; 256 bytes for each jump target ... align 256 JMP_FF: This method has a baby brother. The emulators moved away from individual dispatch when dynamic re-compilers became popular - higher latency, but greater throughput. [edit] I remembered an earlier reference, Steve "Woz" Wozniak's SWEET16 used both a data table and code aligned in memory for virtual instruction dispatch. |
|||
![]() |
|
Fastestcodes 13 Aug 2022, 05:29
"lea rbx,[JMP_00] ; If the JMP_00 0000000012243648h
... mov bh,al ; rbx will be 0000000012240048 to 000000001224ff48 Why "align 1 shl 16" and "align 256"? Your code seems short and fast. |
|||
![]() |
|
Fastestcodes 13 Aug 2022, 08:06
bitRAKE wrote: Some processors have a problem predicting too many jumps in a short span of code. How mod this code for 65536 orders? |
|||
![]() |
|
macomics 13 Aug 2022, 08:12
Code: ; ax - select, cl - align (N) lea rbx, [JMP_0000] ror rbx, cl mov bx, ax rol rbx, cl jmp rbx ... align 1 shl (16 + N) JMP_0000: ... ; 2^N bytes for each jump target ... align N JMP_FFFF: ... |
|||
![]() |
|
Fastestcodes 13 Aug 2022, 09:39
Thx all.
|
|||
![]() |
|
macomics 13 Aug 2022, 10:10
With N = 8, only a block of such labels from JMP_0000 to JMP_FFFF will take about 16 Mb!
|
|||
![]() |
|
Fastestcodes 13 Aug 2022, 11:21
macomics wrote: With N = 8, only a block of such labels from JMP_0000 to JMP_FFFF will take about 16 Mb! Yes, and 65536 order only. Not a too difficult machnine. Artifical Unintelligence. ![]() |
|||
![]() |
|
bitRAKE 13 Aug 2022, 11:38
Fastestcodes wrote: "lea rbx,[JMP_00] ; If the JMP_00 0000000012243648h Another thing we can examine is value of BL. Testing will show that code alignment greater than 16 does very little - maybe, save a cacheline. So, the value of BL has several options - as long as relative offset of branches is 256. On the 65536 case ... We could be super ambitious and maybe pass with N=3, 512k. One has 64k cases to optimize/automate. ![]() |
|||
![]() |
|
revolution 13 Aug 2022, 11:58
I can't imagine any real world code that needs a 64k-wide branch selection.
It would be awesome fun to have write, and debug, and test, and verify all of those individual 64k branches of code. One could complete 10 per day and it would be done in just under 18 years. ![]() |
|||
![]() |
|
edfed 13 Aug 2022, 15:25
look up table is the best choice
but not only, you can add as many indiretion as you want for example if you have a function table, and each function entry is a structure with a list of versions(1.0, coder1, coder2 ...), modes( debug,sandbox...) , and so on... Code: mov eax,[functable+eax*4] mov eax,[eax+functionmode] mov ebx,[functionversion] mov eax,[eax+ebx] call [eax] if you make a switchcase, it is ugly when you get many entries but very convenient when you want to code fast and easy like dispatchmessage in windows process. |
|||
![]() |
|
Overclick 13 Aug 2022, 16:11
Also it can be some parallel calculations instead of that number of labels (SSE).
I do that for 256 virtual channels by 480 factor variables peak cutters and normalize algorithm. |
|||
![]() |
|
Furs 13 Aug 2022, 17:11
revolution wrote:
Really the only time it would not be better is if 90% of your data is in the first or second value checked linearly or so. And even in that case I doubt it's worse, because what you mentioned, branch prediction, actually helps binary search more, not goes against it. |
|||
![]() |
|
macomics 13 Aug 2022, 17:56
Furs wrote:
Here's a simple task for you to think about. Given 2 identical glass balls and a skyscraper on N floors. It is known for sure that the ball definitely breaks when thrown from the last floor, and also that both balls break when thrown from the same floor. Find the minimum floor for the minimum number of throws from which the ball will break when thrown? Solving this problem, it may become clear to you why C/Pascal compilers do not generate a large number of jumps by creating a binary tree when generating a switch/case. Last edited by macomics on 13 Aug 2022, 21:54; edited 1 time in total |
|||
![]() |
|
Goto page 1, 2, 3, 4, 5, 6 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.