Stack Realignment "Techniques"

Index > Main > Stack Realignment "Techniques"

Goto page 1, 2, 3, 4, 5, 6, 7, 8 Next

Author

Thread

Furs

Joined: 04 Mar 2016
Posts: 2697

Furs 27 Aug 2017, 22:41

Yeah, I know the commonly used idiom, so I tested GCC/ICC (and 32 bit mode to make sure no weird x64 ABI "requirements") with a 256-bit vector (32-byte alignment). Also used a volatile int to test it (increase local frame size by 4). So size of locals is 36, excluding alignment. The code of course is like this:

Code:

push ebp
mov ebp, esp
and esp, -32
sub esp, 64  ; because it won't fit in 32-bytes and has to be aligned to 32 for some reason?

Now it's not even about the fact that GCC wants to keep esp aligned to 32 byte that's my issue. (normally it keeps it at 4-byte with the settings I gave it)

No, the problem is why is and before sub. In fact, this question is broader than GCC: why is the "common idiom" to bitwise and before the subtraction?!? Confused

This code:

Code:

push ebp
mov ebp, esp
sub esp, 36  ; since it only actually needs 36 bytes, it's sufficient
and esp, -32 ; and esp is STILL aligned to 32 bytes here

looks much better. First, in the worst case it will use 64 bytes, however in the best case, it will use only 36. The first one uses at least 64 bytes, and that's its best case.

What's the fucking reasoning for everyone using and before sub? I seriously spent like 1 hour looking through GCC sources and I can't find a proper reasoning for this, but maybe I'm too tired.

I'm asking since I don't know if I should try to even suggest this to GCC patches, but knowing some retards there (who also happen to have broken english), it would probably get ignored. I mean if there's a reasoning for and before sub, or whatever. And it's not only GCC.

27 Aug 2017, 22:41

system error

Joined: 01 Sep 2013
Posts: 670

system error 28 Aug 2017, 08:08

For those with complete mastery of stack programming, it doesn't really matter (add/sub or sub/add) because they know exactly where in aligned memory area to write/read the SSE/AVX data to/from amongst other local data and volatile registers in the stack frame. I am surprised that this kind of question came from you (the master of everything).

Not to troll you around or anything, but please, read a book and practice. Lots of practice.

I am out of here xD

28 Aug 2017, 08:08

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20804
Location: In your JS exploiting you and your system

revolution 28 Aug 2017, 08:17

Furs: It is probably about being unaware and/or not caring. Many people these days treat memory as an infinite resource so "wasting" a few bytes here and there makes no big difference to them. Plus, the attitude of: it works so there is nothing to fix. The only real problem that could occur (aside from simply running out of memory) is cache thrashing causing some performance degradation, and naturally as everyone knows, performance issues are always solved by buying newer hardware. Smile

28 Aug 2017, 08:17

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 28 Aug 2017, 10:47

Jesus christ system error, stop being such a fucking cunt.

28 Aug 2017, 10:47

vivik

Joined: 29 Oct 2016
Posts: 671

vivik 28 Aug 2017, 10:51

I hope there is a way to make programs that not use ebp at all. This means no recursion and no callstack, but it's not that necessary anyway.

28 Aug 2017, 10:51

system error

Joined: 01 Sep 2013
Posts: 670

system error 28 Aug 2017, 10:55

@vivik

Which one of my statement is wrong? Prove that you know better than Furs's "special stack alignment technique". I am interested to know xD

28 Aug 2017, 10:55

system error

Joined: 01 Sep 2013
Posts: 670

system error 28 Aug 2017, 11:13

vivik wrote:

I hope there is a way to make programs that not use ebp at all. This means no recursion and no callstack, but it's not that necessary anyway.

Your hope for non-EBP? Sheessshhh. It's a normal practice if you know your way around the bare metal stack. For example, 64-bit ABI is an attempt to do just that where programmers try to give a flat impression to the CPU while minimizing the cost of stack activation. It can be done with 32-bit ABI as well. The requirement for EBP-based stack programming only comes from high-level libraries (aka calling conventions), not from the CPU. CPU knows no calling conventions.

I warned you before - do not attempt to learn assembly language from C. It's not healthy and often misleading. You didn't listen. Now you're in trouble.

28 Aug 2017, 11:13

system error

Joined: 01 Sep 2013
Posts: 670

system error 28 Aug 2017, 11:29

revolution wrote:

Furs: It is probably about being unaware and/or not caring. Many people these days treat memory as an infinite resource so "wasting" a few bytes here and there makes no big difference to them. Plus, the attitude of: it works so there is nothing to fix. The only real problem that could occur (aside from simply running out of memory) is cache thrashing causing some performance degradation, and naturally as everyone knows, performance issues are always solved by buying newer hardware.

Furs problem is much simpler and more basic than that. HE/SHE DOESN'T UNDERSTAND how stack memory works. His so-called "special technique" comes from his/her own incompetency in understanding simple concepts like little-endian, byte addressable memory addressing.

Simple pseudocode, based on Fur's own statements;

Stack wants:
1) a dword (4)
2) a 32-bytes allocation for a YMM register (32-bytes)

So altogether, it requires a 36 bytes from the stack. Does AND/SUB vs SUB/AND really matter? No!! Only incompetent IDIOTS say it does;

Look at this code... what special technique do you see in addressing an AVX data? Nothing. It requires no special alignment technique. Align it to 32 and you're free to decide where to put / write your AVX data in that particular spot based on your own calculation. That's what assembly programmers do!

Code:

tester1:
        push    ebp
        mov     ebp,esp
        and     esp,-32
        sub     esp,36

        vmovdqa [esp+4],ymm0    ;AVX local. Aligned location is at ESP+4
        mov     eax,[esp]       ;dword local

        mov     esp,ebp
        pop     ebp
        ret


tester2:
        push    ebp
        mov     ebp,esp
        sub     esp,36
        and     esp,-32     ;TOS is aligned to 32.

        vmovdqa [esp],ymm0      ;AVX local. Aligned location is at TOS.
        mov     eax,[esp+32]    ;dword local

        mov     esp,ebp
        pop     ebp
        ret

Does this require any explanation? Are we required to allocate 64 bytes stack space just to patch up for the 'missing space'? I don't think so. This is just as clear as the blue sky for those in the know.

28 Aug 2017, 11:29

Furs

Joined: 04 Mar 2016
Posts: 2697

Furs 28 Aug 2017, 11:35

revolution wrote:

Furs: It is probably about being unaware and/or not caring. Many people these days treat memory as an infinite resource so "wasting" a few bytes here and there makes no big difference to them. Plus, the attitude of: it works so there is nothing to fix. The only real problem that could occur (aside from simply running out of memory) is cache thrashing causing some performance degradation, and naturally as everyone knows, performance issues are always solved by buying newer hardware.

Well that's of course what I also thought myself, but I wanted to ask to see if I was missing anything. I mean remember my rant about compilers never generating better code than hand-written asm, due to the developers not caring, not because of "technical limitations". To them keeping a compiler "simple" is more important than having better code generator for all the code they'll end up compiling it with. With such mentality, it is no wonder they will never beat hand-written asm code.

However this case really pisses me off since, well, the "swap" doesn't even require extra complications. Literally, just swap the and and sub (for GCC e.g. in ix86_expand_prologue hook) and then don't "round up" the stack size to the largest alignment (crtl->stack_alignment_needed used in reload). But I mean even if they still round up the stack (i.e. provide no gain), swapping the two produces no worse results, so it should still be done regardless, IMO.

I'm more annoyed because Clang and ICC do the same thing (or ICC used to, idk these days, can't check godbolt as it doesn't work for some reason), like seriously, why is everyone so incompetent?

Anyway, thanks for "confirming" my suspicions Smile

system error wrote:

For those with complete mastery of stack programming, it doesn't really matter (add/sub or sub/add) because they know exactly where in aligned memory area to write/read the SSE/AVX data to/from amongst other local data and volatile registers in the stack frame. I am surprised that this kind of question came from you (the master of everything).

I will bite and be civil. First, it's "and" not "add".

It's not an arithmetic operation, it's a bitwise operation, which has the effect of conditionally performing arithmetic on it. No, you don't know where exactly the AVX variable is, because you don't have a 32-byte aligned stack pointer, deal with it. You need to conditionally subtract from it, which is what bitwise and does, and also to save it in ebp to be able to perform the same step "back" when exiting the function.

You don't control who calls your function and what stack address it has. Suppose your function gets called with these two stack addresses (before the call) 0x00400010 and then with address 0x00400020.

Last time I challenged you to show me code that "knows exactly where variables are", they were misaligned in one of the two cases. Fixing it for the other case resulting in the other case being misaligned. Your reply was just "LOL" in defeat. Rolling Eyes

So kindly, stop polluting a sensible thread when you have no idea what you're talking about.

28 Aug 2017, 11:35

system error

Joined: 01 Sep 2013
Posts: 670

system error 28 Aug 2017, 11:42

And what the fcuk is SUB ESP,64 is for? Hahaha xD

28 Aug 2017, 11:42

Furs

Joined: 04 Mar 2016
Posts: 2697

Furs 28 Aug 2017, 11:45

Local variables? It's mentioned in the first post. I have a 256-bit AVX vector on the stack and an 4-byte int. 32+4 = 36, and to keep esp aligned (after the sub) to 32-byte the nearest is 64. Aligning esp after the subtraction, though, requires only a sub 36, because it gets aligned after anyway. If the sub 36 makes it misaligned, the bitwise and will align it. It guarantees there's at least 36 bytes of space, so [esp] will be aligned (and use the vector there).

Last edited by Furs on 28 Aug 2017, 11:46; edited 1 time in total

28 Aug 2017, 11:45

system error

Joined: 01 Sep 2013
Posts: 670

system error 28 Aug 2017, 11:45

Furs wrote:

revolution wrote:
Furs: It is probably about being unaware and/or not caring. Many people these days treat memory as an infinite resource so "wasting" a few bytes here and there makes no big difference to them. Plus, the attitude of: it works so there is nothing to fix. The only real problem that could occur (aside from simply running out of memory) is cache thrashing causing some performance degradation, and naturally as everyone knows, performance issues are always solved by buying newer hardware.
Well that's of course what I also thought myself, but I wanted to ask to see if I was missing anything. I mean remember my rant about compilers never generating better code than hand-written asm, due to the developers not caring, not because of "technical limitations". To them keeping a compiler "simple" is more important than having better code generator for all the code they'll end up compiling it with. With such mentality, it is no wonder they will never beat hand-written asm code.

However this case really pisses me off since, well, the "swap" doesn't even require extra complications. Literally, just swap the and and sub (for GCC e.g. in ix86_expand_prologue hook) and then don't "round up" the stack size to the largest alignment (crtl->stack_alignment_needed used in reload). But I mean even if they still round up the stack (i.e. provide no gain), swapping the two produces no worse results, so it should still be done regardless, IMO.

I'm more annoyed because Clang and ICC do the same thing (or ICC used to, idk these days, can't check godbolt as it doesn't work for some reason), like seriously, why is everyone so incompetent?

Anyway, thanks for "confirming" my suspicions

system error wrote:
For those with complete mastery of stack programming, it doesn't really matter (add/sub or sub/add) because they know exactly where in aligned memory area to write/read the SSE/AVX data to/from amongst other local data and volatile registers in the stack frame. I am surprised that this kind of question came from you (the master of everything).
I will bite and be civil. First, it's "and" not "add".

It's not an arithmetic operation, it's a bitwise operation, which has the effect of conditionally performing arithmetic on it. No, you don't know where exactly the AVX variable is, because you don't have a 32-byte aligned stack pointer, deal with it. You need to conditionally subtract from it, which is what bitwise and does, and also to save it in ebp to be able to perform the same step "back" when exiting the function.

You don't control who calls your function and what stack address it has. Suppose your function gets called with these two stack addresses (before the call) 0x00400010 and then with address 0x00400020.

Last time I challenged you to show me code that "knows exactly where variables are", they were misaligned in one of the two cases. Fixing it for the other case resulting in the other case being misaligned. Your reply was just "LOL" in defeat.

So kindly, stop polluting a sensible thread when you have no idea what you're talking about.

The code right above you shows how INCOMPETENT you are. Which part of it that you do not understand? Wanna a free schooling today? xD

28 Aug 2017, 11:45

Furs

Joined: 04 Mar 2016
Posts: 2697

Furs 28 Aug 2017, 11:46

The part where you use 2 functions when there should only be 1 working in both cases, moron. ONE function.

28 Aug 2017, 11:46

system error

Joined: 01 Sep 2013
Posts: 670

system error 28 Aug 2017, 11:48

Furs wrote:

Local variables? It's mentioned in the first post. I have a 256-bit AVX vector on the stack and an 4-byte int. 32+4 = 36, and to keep esp aligned (after the sub) to 32-byte the nearest is 64. Aligning esp after the subtraction, though, requires only a sub 36, because it gets aligned after anyway. If the sub 36 makes it misaligned, the bitwise and will align it. It guarantees there's at least 36 bytes of space, so [esp] will be aligned (and use the vector there).

It's all there in my pseudocode

In both cases, the AVX location is well aligned to 32. (32)
It also includes a space for a dword.(4)

So simple math:: 32+4 is 36.

WTF is wrong with your 5th grader math? xD

28 Aug 2017, 11:48

system error

Joined: 01 Sep 2013
Posts: 670

system error 28 Aug 2017, 11:50

Furs wrote:

The part where you use 2 functions when there should only be 1 working in both cases, moron. ONE function.

No dumbfcuk. It shows you that it makes no differences in both cases. It's a proof-of-concept pseudocode for comparison!!

Furs, furs furs, apparantly, there's no remedy to you INCOMPETENCY!!

Go read a book and practice!!

28 Aug 2017, 11:50

Furs

Joined: 04 Mar 2016
Posts: 2697

Furs 28 Aug 2017, 11:54

Uh, yeah it is, but [esp] is not aligned to 32-bytes, which seems to be important for compilers. I don't care what you think, the sub esp, 64 is generated by the COMPILER because it wants to keep it aligned to 32 for whatever reason.

On the other hand, your solution has virtually NO ADVANTAGE compared to mine. Mine has [esp] aligned to 32, yours [esp+4] (first one) using the same amount of space -- this is called alignment offset or bias. Yours is biased by +4.

What the fuck are you even trying to argue about? Something obviously inferior and harder to get right by the compiler? Just to disagree with me? Just admit it, your code is in no way, shape or form, better, and the compiler doesn't generate it either, because it wants to keep [esp] aligned for whatever reason, which mine does.

Also, stop making double posts.

28 Aug 2017, 11:54

system error

Joined: 01 Sep 2013
Posts: 670

system error 28 Aug 2017, 12:00

Furs wrote:

Uh, yeah it is, but [esp] is not aligned to 32-bytes, which seems to be important for compilers. I don't care what you think, the sub esp, 64 is generated by the COMPILER because it wants to keep it aligned to 32 for whatever reason.

On the other hand, your solution has virtually NO ADVANTAGE compared to mine. Mine has [esp] aligned to 32, yours [esp+4] (first one) using the same amount of space.

What the fuck are you even trying to argue about? Something obviously inferior and harder to get right by the compiler? Just to disagree with me?

Also, stop making double posts.

OMG, I can't believe this dumbfcuk, master-of-everything.

AVX data is required to be 32-byte aligned. It doesn't matter where the alignment lies in the the stack frame.

1) In tester1, the aligned location is at ESP+4
2) In tester2, the aligned location is at ESP (top of stack).

So it doesn't matter whether it is AND/SUB or SUB/AND as long as you know where the aligned area is. It's your technical KNOWLEDGE than counts!

I can teach a Gorilla to fly in less than a page. But this IDIOT right here probably needs two full threads! HAHAHAHA XD

28 Aug 2017, 12:00

Furs

Joined: 04 Mar 2016
Posts: 2697

Furs 28 Aug 2017, 12:03

Man, you can't understand simple english? I didn't say your code doesn't have the vector aligned. I said it doesn't have [esp] aligned, which is what the COMPILER wants to do. Your solution does not satisfy the compiler (hint: it's the purpose of the thread) and it also requires more work to change the compiler (it needs to keep alignment bias somewhere).

Now, your code also has zero advantage compared to mine, so tell me again, what exactly are you arguing about?

"Look, I can make a code inferior to Furs' just to feel special"?

My vector is at [esp], yours is at [esp+4]. The compiler wants [esp] or [esp+32] or whatever -- do you understand this simple fact?

More importantly how is yours in any way better than mine? Just shut up already.

28 Aug 2017, 12:03

system error

Joined: 01 Sep 2013
Posts: 670

system error 28 Aug 2017, 12:13

Furs wrote:

Man, you can't understand simple english? I didn't say your code doesn't have the vector aligned.

No you LIED. You said earlier that my code is not vector-aligned. Why the sudden change of mind? Did you bitchslap you that hard? hahaha xD

Quote:

I said it doesn't have [esp] aligned, which is what the COMPILER wants to do. Your solution does not satisfy the compiler (hint: it's the purpose of the thread) and it also requires more work to change the compiler (it needs to keep alignment bias somewhere).

32-bit compilers don't use ALIGNED AVX! You lied again dumbfcuk? xD

Quote:

Now, your code also has zero advantage compared to mine, so tell me again, what exactly are you arguing about?

I wasn't showing any code. I showed PSUEDOCODE, as proof-of-concepts.

Still, you haven't answered me: why SUB ESP,64. Why not just SUB ESP,36 like I did?

Incompetent? XD

28 Aug 2017, 12:13

Furs

Joined: 04 Mar 2016
Posts: 2697

Furs 28 Aug 2017, 12:52

You realize sub esp, 64 is GCC right? So why the fuck are you asking me? Go ask GCC on their bugzilla or look through GCC sources.

The answer is: they round up the alignment to crtl->stack_alignment_needed -- that variable is the alignment of the largest variable on the stack's alignment required. This is done in the reload pass.

Alright, looks like you don't get it. Here's my last post on your topic. First of all this entire thread is about compiler's output.

So here are facts:

1) GCC generates the "sub esp, 64" to keep esp aligned to 32-bytes. That is GCC, not "Furs", read it until it sinks.
2) I made this thread because I wanted to know if there is a reason for compilers to use and before sub when it can be the other way with almost no modifications at all, and is superior.
3) You posted some totally irrelevant code that has nothing to do with any compiler and... I don't know, to be honest, what your point even is?

How is your babbling on topic?

Now, here's what my plans were:

1) I asked because I want to send a patch request to GCC to swap and/sub for stack realignment prolog.
2) My patch requires about 5 lines of code change to do this. (see ix86_expand_prologue function in config/i386/i386.c for yourself)
3) Your patch requires having GCC hold a "alignment bias" offset to the "stack_alignment_needed" in crtl -- which is extra complications for no gain whatsoever compared to mine (not GCC's). I guess to make yourself feel special.
4) Both of our methods require the compiler to stop rounding up the frame size. (i.e. GCC does 36 -> 64 because a variable has 32-byte alignment requirement). However my method does not suffer even if (4) is not done.

So while technically (4) is nice, swapping and/sub produces nothing wrong (in terms of anything) even if (4) is not done (which is more work).

Once again, all what you said about me is in fact GCC generating it, so go cry to GCC bugzilla.

On the other hand, your method requires far more changes to GCC than mine for literally no gain whatsoever -- not in code size, not in performance, not in upos, not in number of insns, not in anything.

So please, let the people who know what they're talking about speak and go do your own... thing. I'm done here.

You're clueless as fuck, fucking apply a few patches to GCC first and see for yourself how the world works, not on your pseudo code. Maybe you'll grow up that way. And yes, this thread is exactly about compiler output, only about compiler output.

28 Aug 2017, 12:52

Goto page 1, 2, 3, 4, 5, 6, 7, 8 Next

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum