Message board for the users of flat assembler.
> Main > the register dance
sylware 25 Sep 2021, 23:40
While coding on my app, instead of going mostly memory (heap and stack) for my variables, I wanted to try the following "policy": fill all the registers as much as possible, sparing some scratch ones (general purpose and vector) with some smart shuffling based on the usage of their content and the call/jump graph of the code "unit". Ofc, once I needed room in the register space I would use a bit of stack space. While filling the registers, I would "roughly" prioritize callee-saved registers (rbx rbp r12-r15), just in case a future C ABI external calls slips in, then reverse-order fill the argument passing registers (r9 r8 r10/rcx rdx rsi rdi, xmm7->xmm0), that excluding some scratch registers: I was ok with 3/4 general purpose registers, rax rcx r11 r9 and sometimes 4 vector registers, xmm7->4. Leaf code paths would use all argument passing registers as they see fit.
Ok, my app is memory bound then putting as many variables as possible in registers is not really worth it: loads/stores from memory related to those variables are supposed to be neglictible compared to the loads/stores of the bulk. But... training!
To do that, I source-written-tracked the content of the registers all along the call/jump graph.
And you guys? What is you register management policy?
P.S. I wish my laptop was recent enough to handle AVX2 like my workstation...
|25 Sep 2021, 23:40||
sts-q 26 Sep 2021, 04:47
Even after years an unsolved open question!
I like programming in assembler. But there are two things i haven' t found a good solution for:
* when push to stack
* ( nested) if-then-else constructions
What i do is:
have low-level, mid-level and higher-level functions:
low-level uses: a b c d
mid-level uses: si di k v ( source index, destination index, key, value )
higher-level uses: tos sos bp ... and RAM
i do a lot of register renaming:
k equ r8
v equ r8
I think a better solution would be to do more paper-and-pencil-programming, that is first under stand
what i would like to do, than hack it into the computer.
|26 Sep 2021, 04:47||
bitRAKE 26 Sep 2021, 05:57
It's like an ebb-and-flow when I am trying to optimize register usage. Top->down and then bottom->up -- repeat. External APIs are like a boundary condition - it's very strict and much of the state afterward is undefined, and so we can work backward from the API as well as starts over when it is complete.
I too like to use register renaming as well as memory renaming with VIRTUAL blocks.
Macros can be scaled (i.e. called by other macros) if all the operands are constant or parametric. These type of macros are quite universal, whereas using registers or memory directly in a macro restricts it's utility.
Another pattern is to code for both positive and negative logic as a type of control flow optimization. Sometimes the code works out better with the polarity reversed. It's possible to encapsulate this within a macro as well (contrived example):
macro PARITY? true,false,regmem*,bits assert 0 < bits & bits < 9 test regmem,(1 shl bits) - 1 match ,true jpo false else match ,false jpe true else err end match end macro macro ODD? even*,regmem* PARITY even,<>,regmem,1 end macro macro EVEN? odd*,regmem* PARITY <>,odd,regmem,1 end macro
|26 Sep 2021, 05:57||
sylware 26 Sep 2021, 12:03
@sts-q This is roughly what I wrote down in the code, but for each significant code paths (usually following the call/jump graph). I did use the "define" instruction for tracking though, like "define node_p r12".
@bitRAKE yep, multi-pass seems kind of mandatory, and "external" calls are "expensive" if many registers are to be "saved". What I try to do, is to have a "first pass" kind of already trying to roughly optimize register usage, that with this "fill them all" policy.
I noticed too that I tend to avoid to use "call" and then go on the stack to save the RIP, I am more and more using a "link" register with jumps.
And I follow the paranoid "nops align everything" I saw in the HeavyThing assembly project.
|26 Sep 2021, 12:03||
revolution 26 Sep 2021, 12:22
The "nops to align" can be beneficial, and it can be harmful also.
Be aware of your cache wastage when you start inserting nops. Some code segments might see an advantage, others might see a disadvantage. A lot of it is CPU dependent. On some CPUs it makes no difference, on others it can have an impact, either negative and/or positive and/or opposite from other CPUs.
IME putting nops everywhere is a bad strategy overall. I found it just made code larger and more cluttered, with any "benefits" so tiny to be unmeasurable. But I guess on some critical code there might be a use case that can see something useful.
|26 Sep 2021, 12:22||
sylware 26 Sep 2021, 20:51
@revolution the thing with the nops everything: I got my hands on AMD and Intel optimizing manuals, the "16bytes paragraph" seems really critical for both microarchitectures.
I did set up some RDTSC related macros, just in case. I may fool around to torture the alignment of my code to see if it has an obvious impact (on 2 CPUs, zen2 and oooold intel). To be fair, since this code is memory bound, I am not supposed to see anything significant though.
Back to how to track register usage: I did post already something about such of an software assistant. Now, it starts to look like it would be overkill since writing down the register usage at pertinent points in the code did feel like a better compromise.
|26 Sep 2021, 20:51||
sylware 25 Oct 2021, 20:58
I have a pseudo mechanic way (namely it varies a lot based on the context) to use the general registers for variables: first the callee-saved regs, with rbp last (for rsp saving upon C ABI external calls), then the argument passing registers in reverse order. The basic scratch general registers would be rax, rcx and r11.
|25 Oct 2021, 20:58||
< Last Thread | Next Thread >
Copyright © 1999-2023, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.