flat assembler
Message board for the users of flat assembler.

Index > Windows > Handling potential stack overflow without probing

Author
Thread Post new topic Reply to topic
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 13 Sep 2017, 17:18
I'm looking for the most elegant (in terms of assembly, so optimization hacks etc) way to allocate large stuff on the stack (larger than a page, not too large though, let's say 64k bytes for example) on Windows. (Wine too but w/e)

As you know, Windows keeps a guard page at the bottom of the currently committed stack, and accessing it leads to it allocating more stack (i.e. "growing" the stack) up to a limit. This is usually done in function prologues via the _chkstk function in HLL compiled code, which basically does something like (not exactly but you get the point):
Code:
@@:
  sub esp, 4096
  test eax, [esp]
  dec ecx
  jnz @b    
i.e. it "touches" the stack at every page so that it doesn't miss the guard page. If it touches the guard page, the OS silently eats the page fault and allocates more stack space, otherwise the probe does nothing.

I'm looking for a way that does not require probing the stack. I'd like to do it with no function calls either (an implementation with a "chkstk" function would be simple and I'm already aware of), but instead with exception handling. The currently committed "bottom" of the stack is found at [fs:8] for 32-bit.

So I was thinking to have something like:
  • sub from esp required amount (e.g. sub esp, 65536)
  • compare it with [fs:8] and see if it's below (carry flag is set)
  • if it is below, somehow trigger an exception (probably have to move esp back too?), the exception handler will allocate the stack until esp's location (I'll "return" from it in a custom function that probes the stack since I don't want to cause page faults from within an exception handler, it would kill the app)


This has no function calls, but how do you conditionally trigger an exception based on carry flag? Is there an elegant trick that does not hurt the performance of the common case which is when allocation is not needed? (i.e. I need to fall-through the jump in the general case). I can use a conditional jump, but then how do I know where to "return" after the exception? (I'll have only one exception handler to deal with it, not one per function, but one per all such functions!).

The "obvious" one like:
Code:
  cmp esp, [fs:8]
  jae @f
  hlt  ; fault, though I'm not sure if I have to add esp, 65536 first (else exception handler has no stack)
@@:    
the problem with this is that it is not a fall-through in the general case, which hurts performance. Yeah, I can do a jb followed by a jmp but that's a bit more ugly, I'm asking basically, is there some tricky instruction that can cause a fault only when carry flag is set?

Caveats:
  • my application is a DLL -- a plugin, not a library, technically. It gets called back from processing functions (e.g. like VST in Wavosaur, though that's just an example), so I do *not* control the stack whatsoever on entry.
  • it is multi-threaded, and I want an elegant solution, threads are also created by the host of the plugin, but also from within the plugin, so no I cannot just take the stack size from the module, as each thread could have different stack size
Because of the above there is another problem: how do you get the MAXIMUM STACK SIZE of a given THREAD in Windows? I need to do this from the exception handler to see if it's even safe to probe the stack or not.

Hmm I guess this is longer than I expected. I don't expect a solution but maybe I'm missing something Confused I'm just asking out of curiosity and because it would be nicer than just calling a function to do it. (which has more overhead for the common case)


EDIT: Oh and I'd like to incorporate this into my GCC plugin to tweak the prologue too, so I want it as good as possible even if my app by itself doesn't have any measurable improvement. Razz

Also, I've actually measured the following cases from the above stack probing question:
Code:
test eax, [esp]  ; very fast
mov eax, [esp]   ; fastest
bt dword [esp], 0  ; WTF as fast as mov? faster than test?    
Processor Intel Xeon E3-1241 v3 (Haswell) (note that no combination of test, even with constant zero, provided any differences)


Last edited by Furs on 13 Sep 2017, 17:30; edited 2 times in total
Post 13 Sep 2017, 17:18
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20304
Location: In your JS exploiting you and your system
revolution 13 Sep 2017, 17:29
Statically allocate.
Code:
format ...
stack 1 shl 20, 1 shl 20
;...    
For the stack size: It is defined when the thread starts, so for the above it is 1MB for the main thread, and when you create a new thread you state the stack size in your call to the API.
Post 13 Sep 2017, 17:29
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 13 Sep 2017, 17:32
Well that's a solution in other cases but 1) not what I want and 2) not useable in this situation Wink

The host (executable) creates threads and then processes them in parallel, which can also happen to be callbacks to my function (if they're used as plugins in that chain). So I don't control the threads' stack space at all.

Yes I can create threads from within my own plugin too, but I'm talking about the threads created by the host which end up calling my plugin's callback.
Post 13 Sep 2017, 17:32
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20304
Location: In your JS exploiting you and your system
revolution 13 Sep 2017, 17:33
What problem are you trying to solve? Do you have some performance issue? Or something else?
Post 13 Sep 2017, 17:33
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 13 Sep 2017, 17:40
I doubt this would be a performance issue unless the buffer size is set to like 1 sample or something (so then most of the overhead is in calling the function etc).

However, I just want to know if I'm overlooking something. Specifically:

1) is there an instruction that can cause a fault only when carry flag is set? (or something like that, see my cmp above, or an alternative method?)
2) how do you get the maximum stack size of the thread in Windows? (maximum reserved, not allocated, the latter is easy, found at [fs:8] -- I don't know the thread's origins, however)

For my use case it's probably not significant, but when I'll incorporate into my GCC plugin's prologue tweaker, I'd obviously prefer to have the "better" method. It's just in my nature: if you do something, do it right if you can. Razz
Post 13 Sep 2017, 17:40
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20304
Location: In your JS exploiting you and your system
revolution 13 Sep 2017, 17:47
There is INTO, not the carry flag there, but you could also use the overflow flag I suppose.

But I'm still not sure what problem you are trying to solve. If not for performance then what?
Post 13 Sep 2017, 17:47
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 13 Sep 2017, 17:49
I mean, it is for performance, but not necessarily "significant" performance (depending on the DSP buffer size). I didn't mean it's slower, though.

INTO looks interesting, thanks, it doesn't generate an interrupt when OF is not set right? (btw a trick I meant like.. for example using cmovc, but I know that it will always trigger the fault though)

But for my second question I found this SO answer.
Quote:
Whilst there isn't an API to find out stack size directly, contiguous virtual address space must be reserved up to the maximum stack size - it's just that a lot of that space isn't committed yet. You can take advantage of this and make two calls to VirtualQuery.

For the first call, pass it the address of any value on the stack to get the base address and size, in bytes, of the committed stack space. On an x86 machine where the stack grows downwards, subtract the size from the base address and VirtualQuery again: this will give you the size of the space reserved for the stack (assuming you're not precisely on the limit of stack size at the time). Summing the two naturally gives you the total stack size.
But I'm not sure I understand, why can't he just give some stupid code example, ffs. I'll look into VirtualQuery to see, never used it before.

If I'm not mistaken, his "first" VirtualQuery call sounds rather pointless: such information (base address of committed stack) is already at [fs:8] Confused Or I don't get it.


Last edited by Furs on 13 Sep 2017, 17:54; edited 1 time in total
Post 13 Sep 2017, 17:49
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20304
Location: In your JS exploiting you and your system
revolution 13 Sep 2017, 17:54
I don't know of anything that can do what you want. Windows isn't really designed with that in mind. Usually large buffers are expected to be allocated by other means that don't use the stack. And naturally the meaning of "large" is not given so you just have to guess where the trade-off point is.
Post 13 Sep 2017, 17:54
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20304
Location: In your JS exploiting you and your system
revolution 13 Sep 2017, 17:57
I'd expect that base addresses for stacks are from the top going down, not the lowest address.
Post 13 Sep 2017, 17:57
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20304
Location: In your JS exploiting you and your system
revolution 13 Sep 2017, 18:00
"cmovc eax,[0]" will always generate a fault regardless of the flag value. I discovered this a long time ago and it is annoying. I wanted to conditionally read a memory address if is wasn't zero:
Code:
test edx,edx
cmovnz eax,[edx] ;faults when edx == 0    
Post 13 Sep 2017, 18:00
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 13 Sep 2017, 18:07
Yeah, I'm aware about cmov, it was just an example (i.e. that I'm not necessarily looking for an instruction made just for this purpose, like INTO is, but can be any such trick).

It looks like the second problem is solved due to VirtualQuery -- I only need to call it once to get the reserved base address if it's according to the documentation (I'll test it in a few minutes and report back). The stack layout is like this:
Code:
-------------   top of stack, found at [fs:4]
/////////////
\\\\\\\\\\\\\
-------------   bottom of committed stack, found at [fs:8]
-------------   guard page
             \
              \
              | reserved stack (it can grow here)
              /
             /
-------------   limit of stack growth (beyond this is stack overflow), THIS is what I want to find    


Now to figure out a hack for first one Razz (or how to get carry into overflow flag, and test if INTO works on Windows in the first place, maybe it generates faults all the time due to privilege stuff)
Post 13 Sep 2017, 18:07
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20304
Location: In your JS exploiting you and your system
revolution 13 Sep 2017, 18:12
IIRC:

Stack memory base address != data memory base address.

Stack points to the top, data points to the bottom.
Post 13 Sep 2017, 18:12
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 13 Sep 2017, 18:29
revolution wrote:
IIRC:

Stack memory base address != data memory base address.

Stack points to the top, data points to the bottom.
I'm afraid I don't understand what you mean? Can you explain it simpler?

I tried VirtualQuery but the results aren't what I want. I tried both BaseAddress and AllocationBase from its output struct -- both are bad.

BaseAddress seems to be the [fs:8] so it's pointless. AllocationBase doesn't make much sense -- it's always 262144 (1 shl 18 ) bytes below the top of the stack even if I change the stack's size with 'stack' directive.

Just to make sure I even tested it in FASM, here this code:
Code:
stack 1 shl 16
blah rd 8

.code
  start:
        mov eax, [fs:8]
        invoke VirtualQuery, eax, blah, 28
        mov eax, [fs:4]
        sub eax, [blah+4]    
gives 262144 in eax, no matter the value in the 'stack' directive, WTF? (note that [fs:4] - [fs:8] gives 8192 bytes only, so it's not that)
Post 13 Sep 2017, 18:29
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20304
Location: In your JS exploiting you and your system
revolution 13 Sep 2017, 19:47
Furs wrote:
I'm afraid I don't understand what you mean? Can you explain it simpler?
Stack base address is the top of the memory range, and it expands downwards. Data and code base addresses are the bottom of the memory range and expand upwards.
Post 13 Sep 2017, 19:47
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 13 Sep 2017, 20:32
Yeah but I don't touch the data/code base addresses at all here. I view the stack as one large "VirtualAlloc" block in this context (going up, hence the "Base address" of the stack is its final limit since it comes down from the "end" of the block).

BTW, it actually works. Apparently 262144 is the lower limit, and seems to be the default of FASM for some reason. I thought it was 1MB, that's why i got confused. If I use larger values, it reports properly, so this problem is solved. Wink

So the piece of code above works to determine the absolute bottom of the stack (well, the absolute bottom is at blah+4 in it, that code is just subtracting the top from it to get its size -- blah+4 of course refers to AllocationBase, don't worry I won't hardcode offsets when coding it for real this was just a quickie Smile I also don't like using invoke in real asm code...).

What the code does is: it gets the "address of the last committed page" of the stack, uses it in VirtualQuery to get the "address of the base of the entire stack".

For the other problem, I'm thinking a "design" of the prologue like this:
Code:
@@:
add esp, 65536  ; needed else Windows kills the process, since the exception handler needs stack space too
hlt             ; raises exception, the handler deals with probing (actually it "returns" to a normal function but that's implementation detail, can't risk another fault within handler tho)
TheFunction:
  ; push saved regs here
  sub esp, 65536
  cmp esp, [fs:8]
  jb @b

  ; ... actual code    
In this respect, there is minimal overhead for when the stack doesn't need probing -- only one (hopefully predicted) branch and a comparison. I don't see how it could be made with less overhead (in this respect I'll need a vectored handler I guess).

The Exception Handler will recognize this pattern, and scan through the prologue to find out how much to subtract esp by, and then probe the stack, and finally return execution after the jb instruction with the proper stack. Of course I'll first check to see if it fits within the limit of the stack (found by VirtualQuery), if it doesn't well, stack can rest in peace (I'll just show a stack overflow message or something out of the handler). But before I implement it, I'd like to know your ideas, is it good or am I missing something obvious? If you got any elegant tips or something Smile

(again: requirement is to be as MINIMALLY invasive on "normal execution not requiring probing" as possible -- if it's slow when it probes the stack is not a problem since it happens ONCE, but this DSP function gets called many times per second on multiple threads, depending on buffer size, with 1 sample it's like 44100 times per sec and I want to keep its CPU usage low since it's realtime and one out of many effects)

I know you love testing/profiling, but unfortunately I cannot test it here. I mean, I did test it in a loop, but there's literally no difference, you know why? The misprediction happens only for the first time or first few times, so in a testing loop with 1<<31 iterations, it won't make a dent.

However in real code, this won't be ran in a loop, but of course on every call to the function. It may not be a big deal, but if it's a "free change" then why not? I don't see a way to find out if a branch was predicted correctly or not... maybe with rdtscp since it blocks speculative execution?

(and I don't have the actual function's code yet to test, just open to speculation so far)

BTW, I realize I can allocate much more of the stack if I have the information from VirtualQuery and not bother with the "cmp/jb" at all and have all the stack allocated always. But that allocates stack space for each thread, seems a bit wasteful, especially if the host increases reserved stack size "just for the heck of it". Two instructions shouldn't really matter, assuming the branch gets predicted to fall-through (not taken). I hope.

NOTE: This alternative is also dangerous if I use too high a stack, as the stack pointer could go below the stack entirely, and into something else (Heap?) and not crash at all but corrupt random data (because it points to valid pages). That's bad. So I'll still go with the cmp/jb and hlt idea I guess, much safer, since I don't control the incoming stack or its size. I'd like it to be as "bulletproof" as possible in any stupid host that can use my plugin.
Post 13 Sep 2017, 20:32
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20304
Location: In your JS exploiting you and your system
revolution 13 Sep 2017, 20:42
I suppose an alternative could be to simply use the stack without testing anything, and in your exception handler do the probing.
Code:
TheFunction:
  ; push saved regs here
  sub esp, 65536
  ;start using the stack now
  ;if there is a memory violation then use SEH to recover and do the probing.    
Make sure to use the full SEH structure to set the safe stack values.

You could also replace your HLT with a call to the probing function and a jmp to reenter the function where it left off.
Code:
@@:
add esp, 65536
call prober_thingy
jmp @f
TheFunction:
  ; push saved regs here
  sub esp, 65536
  cmp esp, [fs:8]
  jb @b
@@:
  ; ... actual code    
Post 13 Sep 2017, 20:42
View user's profile Send private message Visit poster's website Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 2493
Furs 13 Sep 2017, 20:52
I tried it the first time but it doesn't work for two reasons:

1) Windows *wants* the stack (at esp) to be available when exception occurs; SEH uses *this* stack to call your code, not a different one, which sucks. What happens is Windows will just kill the process (Wine seems to work fine but Wine apparently allocates all stack on default thread?)

2) It's dangerous if I use something larger than 65536 -- imagine I subtract esp below the entire stack, into valid memory (Heap?), then there won't be any exception, only silent corruption of random data (this was a huge thing in earlier Linux versions, not sure if Windows suffers from it, but yeah, it sounds bad).


To solve (1) I could instead probe just the end of the stack I want before subtracting, i.e.:
Code:
test eax, [esp-65536]
sub esp, 65536    
this will fault properly, but then again, it's only 1 instruction less now and a bit less safe due to (2) Razz (assuming jump gets predicted)


The call sounds good, not sure why I hadn't thought of it, perhaps I still had in my mind to "not touch the stack"... Wink thanks
Post 13 Sep 2017, 20:52
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20304
Location: In your JS exploiting you and your system
revolution 13 Sep 2017, 21:08
Hmm, I see:
TFM wrote:
The exception handler specified by lpTopLevelExceptionFilter is executed in the context of the thread that caused the fault. This can affect the exception handler's ability to recover from certain exceptions, such as an invalid stack.
Annoying.
Post 13 Sep 2017, 21:08
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.