flat assembler
Message board for the users of flat assembler.
Index
> Main > PIE friendly indexed jump table Goto page Previous 1, 2 |
Author |
|
sylware 23 Aug 2021, 23:22
I should favor reasonably "fast" version, namely the code should not become an obvious bottleneck because I am missing something which is critical and obvious for the experienced coders.
I know I will have to deal with the performance counters at some point using realistic datasets. |
|||
23 Aug 2021, 23:22 |
|
bitRAKE 24 Aug 2021, 01:05
Here is one possibility - just to get things working ...
Code: struc(name) MakeFunctions match =?end?,name ; note: last function doesn't need a tail restruc MakeFunctions ; only one array of functions PreviousFunction reequ name else if ~ definite FirstFunction FirstFunction := 1 FunctionMax = 0 postpone match =?end?,PreviousFunction else match any,PreviousFunction err 10,'"?End MakeFunctions" required to terminate function list (',`any,' was last function)' end match FunctionSize := FunctionMax end postpone macro CallFunction value* imul eax,value,FunctionSize lea rcx,[name] ; RIP relative add rax,rcx call rax end macro else .code_bytes := $ - PreviousFunction if FunctionMax < .code_bytes FunctionMax = .code_bytes end if rb FunctionSize - .code_bytes end if PreviousFunction reequ name name: end match end struc It can be used like: Code: ; just being silly creating different sized functions Grape MakeFunctions mov eax,0 retn Molly MakeFunctions xor qword [rdx*8+rbx+0x1234567],-0x01234568 push 1 pop rax retn Jovial MakeFunctions retn Yellow MakeFunctions mov ax,3 retn Bacon MakeFunctions mov al,4 retn ?End MakeFunctions ; functions can be called directly from internal code: call Yellow ; or they can be called dynamically by index: YELLOW := 3 fNum dd YELLOW CallFunction [fNum] Could add a feature to generate exported function numbers and labels - maybe later. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
24 Aug 2021, 01:05 |
|
sylware 24 Aug 2021, 12:25
thx for the help, this seems really overkill! We'll end-up writting a compiler in fasmg macro/calm syntax (no offence intended).
It looks like I have to choose between: - one data cache line miss with one jump, whole code is in one 16-bytes paragraph. - one instruction cache miss with two jumps, and the code fits on several 16-bytes paragraphs (It seems it is better to have the jump at the beginning of a 16-bytes paragraph, and as few as possible). Is there any experience coder who know the less worse answer? Naively I would favor the first one. |
|||
24 Aug 2021, 12:25 |
|
revolution 24 Aug 2021, 13:50
sylware wrote: Is there any experience coder who know the less worse answer? Different systems/CPUs/components/OS/stuff all behave differently. You'll just have to measure. |
|||
24 Aug 2021, 13:50 |
|
sylware 24 Aug 2021, 15:20
I know, that's why I did mention the performance counters. But, there are obvious things to do or not to do, for instance aligned/unaligned data access
... and experienced coders may know something about more complex patterns like this one, hence my forum post! |
|||
24 Aug 2021, 15:20 |
|
revolution 24 Aug 2021, 15:30
sylware wrote: I know, that's why I did mention the performance counters. But, there are obvious things to do or not to do, for instance aligned/unaligned data access In my experience you can't compare different code and different systems in any simple way. In my experience there is no shortcut to measuring it. Anything else is just guessing. If you don't measure then how can you know it is the best? |
|||
24 Aug 2021, 15:30 |
|
macomics 24 Aug 2021, 18:30
First, write at least something working, and then you will think about optimization. Just to make a program from optimized blocks will not work.
|
|||
24 Aug 2021, 18:30 |
|
revolution 25 Aug 2021, 02:57
To comment about the "obvious" aligned vs unaligned.
This doesn't give a universal "option A is the best, option B is the worst" result. Because there are use cases where using unaligned data/code can allow better cache usage, improving throughput. Packing more data into a cache line, even if unaligned, can make a large difference. But you have to try it out to know. Sometimes it is awful, sometimes it is awesome. |
|||
25 Aug 2021, 02:57 |
|
sylware 26 Aug 2021, 09:10
Got my hands on AMD optimizing manual. Yeah, there is a lot to do and a lot to avoid. It means a lot to keep in mind.
Got my hands on Adler performance counter assembly package too. I am currently plugging asmlib memory functions into the heavything stuff... well, in the end I may have to go performance counter if I am not satisfied with the speed. |
|||
26 Aug 2021, 09:10 |
|
macomics 26 Aug 2021, 14:05
sylware wrote: Got my hands on AMD optimizing manual. sylware wrote:
|
|||
26 Aug 2021, 14:05 |
|
sylware 27 Aug 2021, 16:09
This is obvious. Some speed critical code paths may need some switch based on the processor microarchitecture and generation. Did not find the intel optimizing manual (why??? did I use google wrong?). I may be able to write, right from the start, code based on microarchitecture shared optimization patterns (for instance both have 64 bytes cache lines).
Pushing harder the optimization would be required only if I think the speed or the program was not satisfying. |
|||
27 Aug 2021, 16:09 |
|
revolution 27 Aug 2021, 16:23
There are two main ways to approach that also.
Switches that are activated at compile time and then create different exe files, each optimised to a particular system. Or switches activated at runtime and have a single mega-exe file deployed to all systems. I personally prefer the mega-exe approach because it is fewer things to go wrong to deploy a single file. And often the critical parts of the code that have the multiple paths can usually be measured in kB. Which isn't a big deal even if it was an extra MB IMO. |
|||
27 Aug 2021, 16:23 |
|
macomics 27 Aug 2021, 16:32
https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html#optimization
Do not trust Google, it often hides the truth from you |
|||
27 Aug 2021, 16:32 |
|
bitRAKE 28 Aug 2021, 11:58
bitRAKE wrote: We can talk about the features of each method. The table method (unless it's in a loop) is guaranteed a code and data cache miss. Whereas the table-less method is just a code cache miss. As a more concrete use case: suppose we have hashed some larger set down to our table granularity, but it's not completely perfect. Of course, plain gapped functions could be duplicated or JMP around - just seems like a greater waste of space. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
28 Aug 2021, 11:58 |
|
Goto page Previous 1, 2 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.