flat assembler
Message board for the users of flat assembler.

Index > Main > PIE friendly indexed jump table

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
bitRAKE



Joined: 21 Jul 2003
Posts: 3307
Location: vpcmipstrm
bitRAKE
We can talk about the features of each method. The table method (unless it's in a loop) is guaranteed a code and data cache miss. Whereas the table-less method is just a code cache miss. With macros I think both methods can be automated.

One of the simplest byte code machines is realized with table-less method:
Code:
RunCode:
.loop:
        ; setup for next instruction byte code
        call @F
        jmp .loop
@@:
        lodsb
        ; any method works for dispatcher
        call Method_00
align $10000 ; no bytes generated, just informational
Method_00:
        test al,al
        jz .stop
        ; dispatch to byte code instruction
        mov [rsp+1],al
        retn
.stop: ; actual 00 method code
        pop rax ; discard instruction dispatch address
        pop rax ; discard .loop return to exit RunCode
        retn

align $100
Method_01:
        retn
...

align $100
Method_FF:
        retn    

_________________
¯\(°_o)/¯ unlicense.org
Post 23 Aug 2021, 19:34
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 81
Location: Marseille/France
sylware
I should favor reasonably "fast" version, namely the code should not become an obvious bottleneck because I am missing something which is critical and obvious for the experienced coders.

I know I will have to deal with the performance counters at some point using realistic datasets.
Post 23 Aug 2021, 23:22
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3307
Location: vpcmipstrm
bitRAKE
Here is one possibility - just to get things working ...
Code:
struc(name) MakeFunctions
        match =?end?,name
                ; note: last function doesn't need a tail
                restruc MakeFunctions ; only one array of functions
                PreviousFunction reequ name
        else
                if ~ definite FirstFunction
                        FirstFunction := 1

                        FunctionMax = 0
                        postpone
                                match =?end?,PreviousFunction
                                else match any,PreviousFunction
                                        err 10,'"?End MakeFunctions" required to terminate function list (',`any,' was last function)'
                                end match
                                FunctionSize := FunctionMax
                        end postpone

                        macro CallFunction value*
                                imul eax,value,FunctionSize
                                lea rcx,[name] ; RIP relative
                                add rax,rcx
                                call rax
                        end macro
                else
                        .code_bytes := $ - PreviousFunction
                        if FunctionMax < .code_bytes
                                FunctionMax = .code_bytes
                        end if
                        rb FunctionSize - .code_bytes
                end if
                PreviousFunction reequ name
                name:
        end match
end struc    
It seems flexible to me, but I don't know of your requirements.

It can be used like:
Code:
; just being silly creating different sized functions

Grape MakeFunctions
        mov eax,0
        retn

Molly MakeFunctions
        xor qword [rdx*8+rbx+0x1234567],-0x01234568
        push 1
        pop rax
        retn

Jovial MakeFunctions
        retn

Yellow MakeFunctions
        mov ax,3
        retn

Bacon MakeFunctions
        mov al,4
        retn

?End MakeFunctions

; functions can be called directly from internal code:
        call Yellow

; or they can be called dynamically by index:
        YELLOW := 3
        fNum dd YELLOW

        CallFunction [fNum]    
Edit: fixed an error and added some assemble-time error checking (i.e. more difficult to use incorrectly). Doing ".my_label MakeFunctions" will break it - should probably use namespace? It just seems these functions would be substantial and not dot names.

Could add a feature to generate exported function numbers and labels - maybe later.

_________________
¯\(°_o)/¯ unlicense.org
Post 24 Aug 2021, 01:05
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 81
Location: Marseille/France
sylware
thx for the help, this seems really overkill! We'll end-up writting a compiler in fasmg macro/calm syntax Smile (no offence intended).

It looks like I have to choose between:
- one data cache line miss with one jump, whole code is in one 16-bytes paragraph.
- one instruction cache miss with two jumps, and the code fits on several 16-bytes paragraphs (It seems it is better to have the jump at the beginning of a 16-bytes paragraph, and as few as possible).

Is there any experience coder who know the less worse answer?
Naively I would favor the first one.
Post 24 Aug 2021, 12:25
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
sylware wrote:
Is there any experience coder who know the less worse answer?
Naively I would favor the first one.
There is no way to know without testing on each target system the code is destined for.

Different systems/CPUs/components/OS/stuff all behave differently. You'll just have to measure. Sad
Post 24 Aug 2021, 13:50
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 81
Location: Marseille/France
sylware
I know, that's why I did mention the performance counters. But, there are obvious things to do or not to do, for instance aligned/unaligned data access
... and experienced coders may know something about more complex patterns like this one, hence my forum post! Smile
Post 24 Aug 2021, 15:20
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
sylware wrote:
I know, that's why I did mention the performance counters. But, there are obvious things to do or not to do, for instance aligned/unaligned data access
... and experienced coders may know something about more complex patterns like this one, hence my forum post! Smile
It isn't about experience, it is about your code and your systems.

In my experience you can't compare different code and different systems in any simple way. In my experience there is no shortcut to measuring it. Anything else is just guessing. If you don't measure then how can you know it is the best?
Post 24 Aug 2021, 15:30
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 164
Location: Russia
macomics
First, write at least something working, and then you will think about optimization. Just to make a program from optimized blocks will not work.
Post 24 Aug 2021, 18:30
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
To comment about the "obvious" aligned vs unaligned.

This doesn't give a universal "option A is the best, option B is the worst" result. Because there are use cases where using unaligned data/code can allow better cache usage, improving throughput. Packing more data into a cache line, even if unaligned, can make a large difference. But you have to try it out to know. Sometimes it is awful, sometimes it is awesome.
Post 25 Aug 2021, 02:57
View user's profile Send private message Visit poster's website Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 81
Location: Marseille/France
sylware
Got my hands on AMD optimizing manual. Yeah, there is a lot to do and a lot to avoid. It means a lot to keep in mind.
Got my hands on Adler performance counter assembly package too.
I am currently plugging asmlib memory functions into the heavything stuff... well, in the end I may have to go performance counter if I am not satisfied with the speed.
Post 26 Aug 2021, 09:10
View user's profile Send private message Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 164
Location: Russia
macomics
sylware wrote:
Got my hands on AMD optimizing manual.
For processors from another manufacturer, these tips may be wrong.
sylware wrote:

Got my hands on Adler performance counter assembly package too.
I am currently plugging asmlib memory functions into the heavything stuff... well, in the end I may have to go performance counter if I am not satisfied with the speed.
This will only be true for your computer.
Post 26 Aug 2021, 14:05
View user's profile Send private message Reply with quote
sylware



Joined: 23 Oct 2020
Posts: 81
Location: Marseille/France
sylware
This is obvious. Some speed critical code paths may need some switch based on the processor microarchitecture and generation. Did not find the intel optimizing manual (why??? did I use google wrong?). I may be able to write, right from the start, code based on microarchitecture shared optimization patterns (for instance both have 64 bytes cache lines).
Pushing harder the optimization would be required only if I think the speed or the program was not satisfying.
Post 27 Aug 2021, 16:09
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18222
Location: In your JS exploiting you and your system
revolution
There are two main ways to approach that also.

Switches that are activated at compile time and then create different exe files, each optimised to a particular system.

Or switches activated at runtime and have a single mega-exe file deployed to all systems.

I personally prefer the mega-exe approach because it is fewer things to go wrong to deploy a single file. And often the critical parts of the code that have the multiple paths can usually be measured in kB. Which isn't a big deal even if it was an extra MB IMO.
Post 27 Aug 2021, 16:23
View user's profile Send private message Visit poster's website Reply with quote
macomics



Joined: 26 Jan 2021
Posts: 164
Location: Russia
macomics
Post 27 Aug 2021, 16:32
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3307
Location: vpcmipstrm
bitRAKE
bitRAKE wrote:
We can talk about the features of each method. The table method (unless it's in a loop) is guaranteed a code and data cache miss. Whereas the table-less method is just a code cache miss.
Performance isn't the only thing: That's a good point, but aren't there instances where the pointer table is less messy? For example, mapping multiple values to the same function or changing what function a value maps to.

As a more concrete use case: suppose we have hashed some larger set down to our table granularity, but it's not completely perfect. Of course, plain gapped functions could be duplicated or JMP around - just seems like a greater waste of space.

_________________
¯\(°_o)/¯ unlicense.org
Post 28 Aug 2021, 11:58
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.