flat assembler
Message board for the users of flat assembler.
 Home   FAQ   Search   Register 
 Profile   Log in to check your private messages   Log in 
flat assembler > Heap > Furs & system error: BFF

Goto page Previous  1, 2, 3, 4, 5
Author
Thread Post new topic Reply to topic
system error



Joined: 01 Sep 2013
Posts: 667

Furs wrote:
You cannot align the stack with SUB rsp unless you know the incoming stack alignment exactly and it has to be at least a multiple of what your alignment desires are.

So if the incoming stack is 16-bytes, you *cannot* align it to 32-bytes with "sub" only without using AND or MOD. The reasoning is purely mathematical.

You can have rsp be any multiple of 16, but there's TWO multiples of 16 (values) for EACH 32-byte multiple.

For example, rsp can be 0, which is aligned for both 16 and 32. But rsp can be also 16, which is aligned for 16 but not for 32. rsp can be 32, which is aligned for both, again. But rsp can also be 48, which is aligned only for 16 and not 32.

Do you understand this simple concept?

Subtracting only serves as an offset. For example, here's two cases:


Code:
1if the incoming rsp is 48you can subtract 16 to get a 32-byte aligned vector! but this FAILS if rsp is already 32-byte alignedi.e. if rsp is 32 or 64

2soyou remove the sub rsp -- this works if rsp is 32-byte aligned (3264etc) -- but now it FAILS for rsp 1648etc

You can adjust the sub's constant all you want, it is *impossible* to solve this with just a linear sub operation. Impossible. Mathematically.

You need either to:

1) sub 16 bytes if rsp is not aligned to 32 or
2) sub 0 bytes if rsp is aligned to 32

Such "if" conditional logic cannot be done with a simple subtraction. You can do with a branch, but that's stupid when you can just use AND.

So why do you need a frame pointer? Well, because it is a conditional operation.

You see it subtracts from rsp 16 extra, or zero. But that's based on a condition (its value at entry to the function) that you won't have access to anymore once the condition is done, unless you store the original rsp's value (or whatever) somewhere.

How can you restore the stack or access parameters on the stack (if any) if that depends on a condition? That condition isn't even available anymore for testing, btw. You lost the original rsp's value -- you cannot test anything, you need to store it somewhere to be able to return from the function (restore the stack).

Most people/compilers use rbp / frame pointer for this purpose, but really you can use anything you want. You can even use a global variable (though that is not thread-safe and is a little stupid), or you can use an XMM register or even store it in the x87 FPU if you want (as long as your function doesn't use it), but I don't know if that works on Windows, you need full 80-bits of precision to be able to store a 64-bit pointer in the FPU, since we'll need a 64-bit significand, and I think Windows lowers the precision by default, you'll have to change the FPU Control Word yourself.


Your "alignment chain" works by propagating the alignment through the entire stack (which is a waste of stack space for functions not using vectors btw), but the problem is it has to propagate ONE alignment, and they chose 16-bytes for this. This means it *cannot* align to 32-bytes without using AND or MOD or whatever.

This method is not scalable to the future, it is not extensible.

If they were to chose stack to be 32-byte alignment and propagate that, then you wouldn't need to realign the stack for AVX, at the expense of more wasted stack space (even for SSE functions which need only 16-byte alignment). However what about AVX512 or future vectors?

You see what I'm saying? Such a design is stupid because it is *fixed* and in the future you'll have to "realign" the stack anyway with new vector sizes. That's why the design is shit/flawed. It is flawed at the CORE in concept.

Now, if they were to "propagate" a 512-byte alignment on the stack, that would be a huge waste for functions not using vectors or AVX512. That's why it is obvious that such "solution" to propagate a fixed alignment is stupid.

Just let the damn vector functions realign the stack as they see fit.

What sense does it make to cater an ABI to a particular instruction set that you know will get replaced with better vectors?

Any program that uses

1) no vectors
2) better vectors than SSE

Will suffer from this. It is such a stupid design, seriously. But we're stuck with it and I understand I have to use it if programs are compiled for this ABI already. I understand it is way too late to change it now, that is WHY I am/was ranting originally.

Rants mean complaints that you don't really expect to solve anything you know Wink Just things that piss you off to rant about.



The answer is simple: CHAINED ALIGNMENT

I've been telling you this over and over and yet you fail to appreciate the KNOWLEDGE I am trying to share with you. With chained alignment, the familiar pattern of stack addresses can almost be observed and spotted instantly.

Upon entry, the stack is always unaligned no matter in what functions. I've given your this clue


Code:
 funcA:
    sub rsp,8



This exists in ALL functions which are ABI-compliant. But most of them are concealed by shadow space allocation like SUB RSP,40, which is exactly doing two different tasks 1) Re-align the stack 2) for local shadow space. If you are calling from C, then the first thing that you should do is to re-align the stack to at least SUB RSP,8. It's the same thing, the same behavior, the same pattern in Chained Alignment.

You need to look harder.
Post 04 Apr 2017, 17:40
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 837
Sigh. What part of 32 bytes do you not understand. Dude, chained alignment propagates a specific alignment on the stack. For MS ABI, it propagates ONLY 16-bytes. By definition, this cannot work for 32-bytes, and that's a period.

If it could, then why align to 16-bytes at all? Just align the stack to 8-bytes and use your "magical" chained alignment to align it to 16-bytes? Same as going from 16 to 32 with your "magic" chained alignment, no? What's the difference?

Here, I will use your sub rsp,8 as an example of what happens for 32-bytes. I will show you TWO perfectly valid examples for rsp as input (16-byte aligned, you know? the propagated alignment)

So we have this simple function:

Code:
funcA:
  sub rsp8



Let's call it with different 16-byte aligned rsp and see what we get:


Code:
; case 1: rsp = 64 at this point, which is 32-byte aligned
call funcA    ; rsp = 56

funcA:
  sub rsp8  ; rsp = 48, *FAIL*


Code:
; case 2: rsp = 80 at this point, which is *NOT* 32-byte aligned but it is 16-byte aligned
call funcA    ; rsp = 72

funcA:
  sub rsp8  ; rsp = 64, *GOOD*



Do you see the *FAIL* or do I need to make it larger? It fails in HALF the inputs, which are ALL 16-byte aligned. Because there's TWO 16-byte alignments for each 32-byte alignment. i.e. if X mod 32 is zero, X + 16 is 16-byte aligned but not 32 byte aligned.

Now, you might be tempted, no problem, let's adjust sub rsp to sub rsp, 8+16 aka sub rsp, 24?


Code:
; case 1: rsp = 64 at this point, which is 32-byte aligned
call funcA    ; rsp = 56

funcA:
  sub rsp24 ; rsp = 32, *GOOD* wow it worked!!!


Code:
; case 2: rsp = 80 at this point, which is *NOT* 32-byte aligned but it is 16-byte aligned
call funcA    ; rsp = 72

funcA:
  sub rsp24 ; rsp = 48, *FAIL* oops NOW IT FAILS ON SECOND CASE



How simpler can I make this?

You've failed to use your chained alignment to show one function work for all 16-byte aligned inputs, and still have the result 32 byte aligned.

What part of conditional subtraction did you not understand?

If you "fix" the subtraction of rsp to a specific value, you will fail in half the inputs.

The only solution without realigning the stack is to PROPAGATE/CHAIN ALIGN 32-BYTES!!!. Meaning we need a NEW ABI for each vector extension if we are to avoid realignment.

Obviously, this is a massive failure already, since ABIs are supposed to be stable, cause code gets compiled to them already. Plus it would waste even more bytes for functions not using vectors if we were to align everything to 32 bytes. Anyway, MS ABI is already done and it propagates only 16 bytes, thus for anything larger it is mathematically impossible to use a fixed subtraction.


Think about it another way: if you could use your so-called chained alignment to align 32 bytes from 16 bytes, then what makes you think you couldn't use the same chained alignment to align 16 bytes from 8 bytes????!??

The latter case would get rid of all "alignment requirements" in the first place. 8 bytes is natural alignment, since the call pushes the return address, which is 8 bytes.

It makes no sense to propagate 16-bytes and think you can 32-byte align without bitwise AND, while claiming you can't do the same for 8-byte to 16-byte.


I wish we had a magical chained alignment as you dream of (no sarcasm), so I didn't have to realign the stack and neither have the stupid ABI require 16 bytes alignment for SSE. But the real world doesn't work that way, sadly.
Post 05 Apr 2017, 12:30
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15177
Location: GW170817
If you need 32-byte stack alignment you can use:

Code:
and rsp,-32

But first you'll need to save the previous rsp value somewhere (usually it is saved in rbp).
Post 05 Apr 2017, 12:47
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 837
^ that's what I've been telling him for like 2 pages.

@system error: let me ask you one question so you can (hopefully) understand this concept.

Let's start with one fact: MS ABI mandates 16-byte alignment before a call instruction, correct? So we can say, it is "16 byte aligned".

OK, you claim it is possible to go to 32 byte alignment from 16, without using bitwise AND or MOD or whatever. You claim you can just use subtraction/addition to get there. (even if you don't want to show me how, that's fine)

Question to you: Why can't you use the same magic sauce and go from 8 byte alignment to 16 then? And get rid of the 16-byte alignment requirement in the first place? hmm?

So you must agree the 16-byte requirement is a waste since you can get 16-byte alignment from 8, just like you get 32 from 16, right? If not, what makes 16->32 more special than 8->16 and more "doable without AND"? Really trying to understand your logic on this one.
Post 05 Apr 2017, 13:52
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15177
Location: GW170817
If you align the stack to 32-bytes at the start of your program then you can keep it aligned from that point onwards in a similar fashion to how fastcall achieves 16-byte alignment. You would have to to it manually, or use your own macros to do it, since standard macros won't do it for you.
Post 05 Apr 2017, 14:09
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 837
No, that's not possible, unless you use a different ABI, which was the whole point. Confused

If you call the AVX function from code place A only then yeah you can theoretically see "where" it gets called and adjust it accordingly. But what if you call it from another place which has a different alignment shift? (still 16-bytes but not 32)?

Yes you can propagate 32-byte alignment but like I said that means a different calling convention. Sure you can do THAT but the whole point was why MS ABI sucks not how to use your own calling convention Razz

Well ok, it doesn't suck for people using SSE vectors (not scalars) a lot. For everyone else (functions not using vectors, or functions using *any* other vector size), it is a waste. It is "tied" to SSE vectors and that's the problem, in my eyes.
Post 06 Apr 2017, 10:45
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 15177
Location: GW170817
fastcall only guarantees 16-byte alignment, so yes I suppose strictly speaking you would be using a different convention. But it would be backwards compatible with fastcall for calling OS functions. As assembly programmers these are the things we do all the time to make our job easier. It wouldn't be much of a change to the normal fastcall macro to convert it to avxcall with 32-byte alignment. However for most applications it wouldn't be necessary to always keep 32-byte alignment. Only those parts which require such alignment could benefit from it.

Also there is no need to use standard fastcall within your own code. You only need that when calling out to an OS function. So if you want you can use any call format you desire, even having every call be unique, within your own code. It might be a pain to maintain but the CPU won't care.
Post 06 Apr 2017, 10:57
View user's profile Send private message Visit poster's website Reply with quote
system error



Joined: 01 Sep 2013
Posts: 667
Furs

It's not that SUB RSP,8 that I was specifically talking about. It's about how to draw your own formula/expression to achieve such unconventional non-compliant, non-framed alignment based on the observable patterns of the addresses. This is pure arithmetic works. You can combine multiple operators to achieve that (shl, shr, mod, and, not, neg, or, etc). Different people use different techniques / combo. That's why when you claimed to be a master of AVX/AVX2, this problem should be easily solved. It's just integers anyway. So, instead of doing this


Code:
push rbp
mov rbp,rsp
sub rsp,96
and rsp,-32
...



You can instead just do

Code:
sub rsp,*** your trade secret here using multiple operators ***



Go figure.
Post 06 Apr 2017, 19:19
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 837
What? You say we apply the operators on the constant? But we need to apply the AND (or MOD, which is the same thing, only that it is more "mathematical" than bitwise operators) on rsp itself, not on its constant in the sub. Confused

This is a mathematical problem (with MOD) and it cannot be simplified, it's like trying to simplify "x*5 + 3" by doing something like idk "x*(5+3)" (folding the constants, which is obviously a completely different operation).

What I mean is that

Code:
(rsp MOD x) - y

cannot be simplified, i.e. the following is wrong (produces different results)

Code:
rsp - (x MOD y)

or something like that... (x and y can be anything btw, you can tweak them to anything and it won't be the same operation)
Post 07 Apr 2017, 11:23
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 667

Furs wrote:
What? You say we apply the operators on the constant? But we need to apply the AND (or MOD, which is the same thing, only that it is more "mathematical" than bitwise operators) on rsp itself, not on its constant in the sub. Confused

This is a mathematical problem (with MOD) and it cannot be simplified, it's like trying to simplify "x*5 + 3" by doing something like idk "x*(5+3)" (folding the constants, which is obviously a completely different operation).

What I mean is that

Code:
(rsp MOD x) - y

cannot be simplified, i.e. the following is wrong (produces different results)

Code:
rsp - (x MOD y)

or something like that... (x and y can be anything btw, you can tweak them to anything and it won't be the same operation)



You're not even close. But don't give up.
Post 10 Apr 2017, 13:45
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 837
Never seen any other "method" of aligning the stack (or anything in memory, really), so you should probably enlighten us with your secret method, as I don't feel like chasing impossible (to me) solutions.

Of course for the stack it is easier as you only need an AND instruction because it grows down. In memory or when aligning sizes or whatever, usually you align it "upward" not downward so you need an extra ADD instruction before the AND (i.e. the classic (x + align-1) & -align trick)
Post 11 Apr 2017, 11:22
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5

< Last Thread | Next Thread >

Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Powered by phpBB © 2001-2005 phpBB Group.

Main index   Download   Documentation   Examples   Message board
Copyright © 2004-2016, Tomasz Grysztar.