flat assembler
Message board for the users of flat assembler.
Index
> Main > Stack Realignment "Techniques" Goto page 1, 2, 3, 4, 5, 6, 7, 8 Next |
Author |
|
system error 28 Aug 2017, 08:08
For those with complete mastery of stack programming, it doesn't really matter (add/sub or sub/add) because they know exactly where in aligned memory area to write/read the SSE/AVX data to/from amongst other local data and volatile registers in the stack frame. I am surprised that this kind of question came from you (the master of everything).
Not to troll you around or anything, but please, read a book and practice. Lots of practice. I am out of here xD |
|||
28 Aug 2017, 08:08 |
|
revolution 28 Aug 2017, 08:17
Furs: It is probably about being unaware and/or not caring. Many people these days treat memory as an infinite resource so "wasting" a few bytes here and there makes no big difference to them. Plus, the attitude of: it works so there is nothing to fix. The only real problem that could occur (aside from simply running out of memory) is cache thrashing causing some performance degradation, and naturally as everyone knows, performance issues are always solved by buying newer hardware.
|
|||
28 Aug 2017, 08:17 |
|
vivik 28 Aug 2017, 10:47
Jesus christ system error, stop being such a fucking cunt.
|
|||
28 Aug 2017, 10:47 |
|
vivik 28 Aug 2017, 10:51
I hope there is a way to make programs that not use ebp at all. This means no recursion and no callstack, but it's not that necessary anyway.
|
|||
28 Aug 2017, 10:51 |
|
system error 28 Aug 2017, 10:55
@vivik
Which one of my statement is wrong? Prove that you know better than Furs's "special stack alignment technique". I am interested to know xD |
|||
28 Aug 2017, 10:55 |
|
system error 28 Aug 2017, 11:13
vivik wrote: I hope there is a way to make programs that not use ebp at all. This means no recursion and no callstack, but it's not that necessary anyway. Your hope for non-EBP? Sheessshhh. It's a normal practice if you know your way around the bare metal stack. For example, 64-bit ABI is an attempt to do just that where programmers try to give a flat impression to the CPU while minimizing the cost of stack activation. It can be done with 32-bit ABI as well. The requirement for EBP-based stack programming only comes from high-level libraries (aka calling conventions), not from the CPU. CPU knows no calling conventions. I warned you before - do not attempt to learn assembly language from C. It's not healthy and often misleading. You didn't listen. Now you're in trouble. |
|||
28 Aug 2017, 11:13 |
|
system error 28 Aug 2017, 11:29
revolution wrote: Furs: It is probably about being unaware and/or not caring. Many people these days treat memory as an infinite resource so "wasting" a few bytes here and there makes no big difference to them. Plus, the attitude of: it works so there is nothing to fix. The only real problem that could occur (aside from simply running out of memory) is cache thrashing causing some performance degradation, and naturally as everyone knows, performance issues are always solved by buying newer hardware. Furs problem is much simpler and more basic than that. HE/SHE DOESN'T UNDERSTAND how stack memory works. His so-called "special technique" comes from his/her own incompetency in understanding simple concepts like little-endian, byte addressable memory addressing. Simple pseudocode, based on Fur's own statements; Stack wants: 1) a dword (4) 2) a 32-bytes allocation for a YMM register (32-bytes) So altogether, it requires a 36 bytes from the stack. Does AND/SUB vs SUB/AND really matter? No!! Only incompetent IDIOTS say it does; Look at this code... what special technique do you see in addressing an AVX data? Nothing. It requires no special alignment technique. Align it to 32 and you're free to decide where to put / write your AVX data in that particular spot based on your own calculation. That's what assembly programmers do! Code: tester1: push ebp mov ebp,esp and esp,-32 sub esp,36 vmovdqa [esp+4],ymm0 ;AVX local. Aligned location is at ESP+4 mov eax,[esp] ;dword local mov esp,ebp pop ebp ret tester2: push ebp mov ebp,esp sub esp,36 and esp,-32 ;TOS is aligned to 32. vmovdqa [esp],ymm0 ;AVX local. Aligned location is at TOS. mov eax,[esp+32] ;dword local mov esp,ebp pop ebp ret Does this require any explanation? Are we required to allocate 64 bytes stack space just to patch up for the 'missing space'? I don't think so. This is just as clear as the blue sky for those in the know. |
|||
28 Aug 2017, 11:29 |
|
Furs 28 Aug 2017, 11:35
revolution wrote: Furs: It is probably about being unaware and/or not caring. Many people these days treat memory as an infinite resource so "wasting" a few bytes here and there makes no big difference to them. Plus, the attitude of: it works so there is nothing to fix. The only real problem that could occur (aside from simply running out of memory) is cache thrashing causing some performance degradation, and naturally as everyone knows, performance issues are always solved by buying newer hardware. However this case really pisses me off since, well, the "swap" doesn't even require extra complications. Literally, just swap the and and sub (for GCC e.g. in ix86_expand_prologue hook) and then don't "round up" the stack size to the largest alignment (crtl->stack_alignment_needed used in reload). But I mean even if they still round up the stack (i.e. provide no gain), swapping the two produces no worse results, so it should still be done regardless, IMO. I'm more annoyed because Clang and ICC do the same thing (or ICC used to, idk these days, can't check godbolt as it doesn't work for some reason), like seriously, why is everyone so incompetent? Anyway, thanks for "confirming" my suspicions system error wrote: For those with complete mastery of stack programming, it doesn't really matter (add/sub or sub/add) because they know exactly where in aligned memory area to write/read the SSE/AVX data to/from amongst other local data and volatile registers in the stack frame. I am surprised that this kind of question came from you (the master of everything). It's not an arithmetic operation, it's a bitwise operation, which has the effect of conditionally performing arithmetic on it. No, you don't know where exactly the AVX variable is, because you don't have a 32-byte aligned stack pointer, deal with it. You need to conditionally subtract from it, which is what bitwise and does, and also to save it in ebp to be able to perform the same step "back" when exiting the function. You don't control who calls your function and what stack address it has. Suppose your function gets called with these two stack addresses (before the call) 0x00400010 and then with address 0x00400020. Last time I challenged you to show me code that "knows exactly where variables are", they were misaligned in one of the two cases. Fixing it for the other case resulting in the other case being misaligned. Your reply was just "LOL" in defeat. So kindly, stop polluting a sensible thread when you have no idea what you're talking about. |
|||
28 Aug 2017, 11:35 |
|
system error 28 Aug 2017, 11:42
And what the fcuk is SUB ESP,64 is for? Hahaha xD
|
|||
28 Aug 2017, 11:42 |
|
Furs 28 Aug 2017, 11:45
Local variables? It's mentioned in the first post. I have a 256-bit AVX vector on the stack and an 4-byte int. 32+4 = 36, and to keep esp aligned (after the sub) to 32-byte the nearest is 64. Aligning esp after the subtraction, though, requires only a sub 36, because it gets aligned after anyway. If the sub 36 makes it misaligned, the bitwise and will align it. It guarantees there's at least 36 bytes of space, so [esp] will be aligned (and use the vector there).
Last edited by Furs on 28 Aug 2017, 11:46; edited 1 time in total |
|||
28 Aug 2017, 11:45 |
|
system error 28 Aug 2017, 11:45
Furs wrote:
The code right above you shows how INCOMPETENT you are. Which part of it that you do not understand? Wanna a free schooling today? xD |
|||
28 Aug 2017, 11:45 |
|
Furs 28 Aug 2017, 11:46
The part where you use 2 functions when there should only be 1 working in both cases, moron. ONE function.
|
|||
28 Aug 2017, 11:46 |
|
system error 28 Aug 2017, 11:48
Furs wrote: Local variables? It's mentioned in the first post. I have a 256-bit AVX vector on the stack and an 4-byte int. 32+4 = 36, and to keep esp aligned (after the sub) to 32-byte the nearest is 64. Aligning esp after the subtraction, though, requires only a sub 36, because it gets aligned after anyway. If the sub 36 makes it misaligned, the bitwise and will align it. It guarantees there's at least 36 bytes of space, so [esp] will be aligned (and use the vector there). It's all there in my pseudocode In both cases, the AVX location is well aligned to 32. (32) It also includes a space for a dword.(4) So simple math:: 32+4 is 36. WTF is wrong with your 5th grader math? xD |
|||
28 Aug 2017, 11:48 |
|
system error 28 Aug 2017, 11:50
Furs wrote: The part where you use 2 functions when there should only be 1 working in both cases, moron. ONE function. No dumbfcuk. It shows you that it makes no differences in both cases. It's a proof-of-concept pseudocode for comparison!! Furs, furs furs, apparantly, there's no remedy to you INCOMPETENCY!! Go read a book and practice!! |
|||
28 Aug 2017, 11:50 |
|
Furs 28 Aug 2017, 11:54
Uh, yeah it is, but [esp] is not aligned to 32-bytes, which seems to be important for compilers. I don't care what you think, the sub esp, 64 is generated by the COMPILER because it wants to keep it aligned to 32 for whatever reason.
On the other hand, your solution has virtually NO ADVANTAGE compared to mine. Mine has [esp] aligned to 32, yours [esp+4] (first one) using the same amount of space -- this is called alignment offset or bias. Yours is biased by +4. What the fuck are you even trying to argue about? Something obviously inferior and harder to get right by the compiler? Just to disagree with me? Just admit it, your code is in no way, shape or form, better, and the compiler doesn't generate it either, because it wants to keep [esp] aligned for whatever reason, which mine does. Also, stop making double posts. |
|||
28 Aug 2017, 11:54 |
|
system error 28 Aug 2017, 12:00
Furs wrote: Uh, yeah it is, but [esp] is not aligned to 32-bytes, which seems to be important for compilers. I don't care what you think, the sub esp, 64 is generated by the COMPILER because it wants to keep it aligned to 32 for whatever reason. OMG, I can't believe this dumbfcuk, master-of-everything. AVX data is required to be 32-byte aligned. It doesn't matter where the alignment lies in the the stack frame. 1) In tester1, the aligned location is at ESP+4 2) In tester2, the aligned location is at ESP (top of stack). So it doesn't matter whether it is AND/SUB or SUB/AND as long as you know where the aligned area is. It's your technical KNOWLEDGE than counts! I can teach a Gorilla to fly in less than a page. But this IDIOT right here probably needs two full threads! HAHAHAHA XD |
|||
28 Aug 2017, 12:00 |
|
Furs 28 Aug 2017, 12:03
Man, you can't understand simple english? I didn't say your code doesn't have the vector aligned. I said it doesn't have [esp] aligned, which is what the COMPILER wants to do. Your solution does not satisfy the compiler (hint: it's the purpose of the thread) and it also requires more work to change the compiler (it needs to keep alignment bias somewhere).
Now, your code also has zero advantage compared to mine, so tell me again, what exactly are you arguing about? "Look, I can make a code inferior to Furs' just to feel special"? My vector is at [esp], yours is at [esp+4]. The compiler wants [esp] or [esp+32] or whatever -- do you understand this simple fact? More importantly how is yours in any way better than mine? Just shut up already. |
|||
28 Aug 2017, 12:03 |
|
system error 28 Aug 2017, 12:13
Furs wrote: Man, you can't understand simple english? I didn't say your code doesn't have the vector aligned. No you LIED. You said earlier that my code is not vector-aligned. Why the sudden change of mind? Did you bitchslap you that hard? hahaha xD Quote: I said it doesn't have [esp] aligned, which is what the COMPILER wants to do. Your solution does not satisfy the compiler (hint: it's the purpose of the thread) and it also requires more work to change the compiler (it needs to keep alignment bias somewhere). 32-bit compilers don't use ALIGNED AVX! You lied again dumbfcuk? xD Quote: Now, your code also has zero advantage compared to mine, so tell me again, what exactly are you arguing about? I wasn't showing any code. I showed PSUEDOCODE, as proof-of-concepts. Still, you haven't answered me: why SUB ESP,64. Why not just SUB ESP,36 like I did? Incompetent? XD |
|||
28 Aug 2017, 12:13 |
|
Furs 28 Aug 2017, 12:52
You realize sub esp, 64 is GCC right? So why the fuck are you asking me? Go ask GCC on their bugzilla or look through GCC sources.
The answer is: they round up the alignment to crtl->stack_alignment_needed -- that variable is the alignment of the largest variable on the stack's alignment required. This is done in the reload pass. Alright, looks like you don't get it. Here's my last post on your topic. First of all this entire thread is about compiler's output. So here are facts: 1) GCC generates the "sub esp, 64" to keep esp aligned to 32-bytes. That is GCC, not "Furs", read it until it sinks. 2) I made this thread because I wanted to know if there is a reason for compilers to use and before sub when it can be the other way with almost no modifications at all, and is superior. 3) You posted some totally irrelevant code that has nothing to do with any compiler and... I don't know, to be honest, what your point even is? How is your babbling on topic? Now, here's what my plans were: 1) I asked because I want to send a patch request to GCC to swap and/sub for stack realignment prolog. 2) My patch requires about 5 lines of code change to do this. (see ix86_expand_prologue function in config/i386/i386.c for yourself) 3) Your patch requires having GCC hold a "alignment bias" offset to the "stack_alignment_needed" in crtl -- which is extra complications for no gain whatsoever compared to mine (not GCC's). I guess to make yourself feel special. 4) Both of our methods require the compiler to stop rounding up the frame size. (i.e. GCC does 36 -> 64 because a variable has 32-byte alignment requirement). However my method does not suffer even if (4) is not done. So while technically (4) is nice, swapping and/sub produces nothing wrong (in terms of anything) even if (4) is not done (which is more work). Once again, all what you said about me is in fact GCC generating it, so go cry to GCC bugzilla. On the other hand, your method requires far more changes to GCC than mine for literally no gain whatsoever -- not in code size, not in performance, not in upos, not in number of insns, not in anything. So please, let the people who know what they're talking about speak and go do your own... thing. I'm done here. You're clueless as fuck, fucking apply a few patches to GCC first and see for yourself how the world works, not on your pseudo code. Maybe you'll grow up that way. And yes, this thread is exactly about compiler output, only about compiler output. |
|||
28 Aug 2017, 12:52 |
|
Goto page 1, 2, 3, 4, 5, 6, 7, 8 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.