flat assembler
Message board for the users of flat assembler.
Index
> Windows > Handling potential stack overflow without probing |
Author |
|
revolution 13 Sep 2017, 17:29
Statically allocate.
Code: format ... stack 1 shl 20, 1 shl 20 ;... |
|||
13 Sep 2017, 17:29 |
|
Furs 13 Sep 2017, 17:32
Well that's a solution in other cases but 1) not what I want and 2) not useable in this situation
The host (executable) creates threads and then processes them in parallel, which can also happen to be callbacks to my function (if they're used as plugins in that chain). So I don't control the threads' stack space at all. Yes I can create threads from within my own plugin too, but I'm talking about the threads created by the host which end up calling my plugin's callback. |
|||
13 Sep 2017, 17:32 |
|
revolution 13 Sep 2017, 17:33
What problem are you trying to solve? Do you have some performance issue? Or something else?
|
|||
13 Sep 2017, 17:33 |
|
Furs 13 Sep 2017, 17:40
I doubt this would be a performance issue unless the buffer size is set to like 1 sample or something (so then most of the overhead is in calling the function etc).
However, I just want to know if I'm overlooking something. Specifically: 1) is there an instruction that can cause a fault only when carry flag is set? (or something like that, see my cmp above, or an alternative method?) 2) how do you get the maximum stack size of the thread in Windows? (maximum reserved, not allocated, the latter is easy, found at [fs:8] -- I don't know the thread's origins, however) For my use case it's probably not significant, but when I'll incorporate into my GCC plugin's prologue tweaker, I'd obviously prefer to have the "better" method. It's just in my nature: if you do something, do it right if you can. |
|||
13 Sep 2017, 17:40 |
|
revolution 13 Sep 2017, 17:47
There is INTO, not the carry flag there, but you could also use the overflow flag I suppose.
But I'm still not sure what problem you are trying to solve. If not for performance then what? |
|||
13 Sep 2017, 17:47 |
|
Furs 13 Sep 2017, 17:49
I mean, it is for performance, but not necessarily "significant" performance (depending on the DSP buffer size). I didn't mean it's slower, though.
INTO looks interesting, thanks, it doesn't generate an interrupt when OF is not set right? (btw a trick I meant like.. for example using cmovc, but I know that it will always trigger the fault though) But for my second question I found this SO answer. Quote: Whilst there isn't an API to find out stack size directly, contiguous virtual address space must be reserved up to the maximum stack size - it's just that a lot of that space isn't committed yet. You can take advantage of this and make two calls to VirtualQuery. If I'm not mistaken, his "first" VirtualQuery call sounds rather pointless: such information (base address of committed stack) is already at [fs:8] Or I don't get it. Last edited by Furs on 13 Sep 2017, 17:54; edited 1 time in total |
|||
13 Sep 2017, 17:49 |
|
revolution 13 Sep 2017, 17:54
I don't know of anything that can do what you want. Windows isn't really designed with that in mind. Usually large buffers are expected to be allocated by other means that don't use the stack. And naturally the meaning of "large" is not given so you just have to guess where the trade-off point is.
|
|||
13 Sep 2017, 17:54 |
|
revolution 13 Sep 2017, 17:57
I'd expect that base addresses for stacks are from the top going down, not the lowest address.
|
|||
13 Sep 2017, 17:57 |
|
revolution 13 Sep 2017, 18:00
"cmovc eax,[0]" will always generate a fault regardless of the flag value. I discovered this a long time ago and it is annoying. I wanted to conditionally read a memory address if is wasn't zero:
Code: test edx,edx cmovnz eax,[edx] ;faults when edx == 0 |
|||
13 Sep 2017, 18:00 |
|
Furs 13 Sep 2017, 18:07
Yeah, I'm aware about cmov, it was just an example (i.e. that I'm not necessarily looking for an instruction made just for this purpose, like INTO is, but can be any such trick).
It looks like the second problem is solved due to VirtualQuery -- I only need to call it once to get the reserved base address if it's according to the documentation (I'll test it in a few minutes and report back). The stack layout is like this: Code: ------------- top of stack, found at [fs:4] ///////////// \\\\\\\\\\\\\ ------------- bottom of committed stack, found at [fs:8] ------------- guard page \ \ | reserved stack (it can grow here) / / ------------- limit of stack growth (beyond this is stack overflow), THIS is what I want to find Now to figure out a hack for first one (or how to get carry into overflow flag, and test if INTO works on Windows in the first place, maybe it generates faults all the time due to privilege stuff) |
|||
13 Sep 2017, 18:07 |
|
revolution 13 Sep 2017, 18:12
IIRC:
Stack memory base address != data memory base address. Stack points to the top, data points to the bottom. |
|||
13 Sep 2017, 18:12 |
|
Furs 13 Sep 2017, 18:29
revolution wrote: IIRC: I tried VirtualQuery but the results aren't what I want. I tried both BaseAddress and AllocationBase from its output struct -- both are bad. BaseAddress seems to be the [fs:8] so it's pointless. AllocationBase doesn't make much sense -- it's always 262144 (1 shl 18 ) bytes below the top of the stack even if I change the stack's size with 'stack' directive. Just to make sure I even tested it in FASM, here this code: Code: stack 1 shl 16 blah rd 8 .code start: mov eax, [fs:8] invoke VirtualQuery, eax, blah, 28 mov eax, [fs:4] sub eax, [blah+4] |
|||
13 Sep 2017, 18:29 |
|
revolution 13 Sep 2017, 19:47
Furs wrote: I'm afraid I don't understand what you mean? Can you explain it simpler? |
|||
13 Sep 2017, 19:47 |
|
Furs 13 Sep 2017, 20:32
Yeah but I don't touch the data/code base addresses at all here. I view the stack as one large "VirtualAlloc" block in this context (going up, hence the "Base address" of the stack is its final limit since it comes down from the "end" of the block).
BTW, it actually works. Apparently 262144 is the lower limit, and seems to be the default of FASM for some reason. I thought it was 1MB, that's why i got confused. If I use larger values, it reports properly, so this problem is solved. So the piece of code above works to determine the absolute bottom of the stack (well, the absolute bottom is at blah+4 in it, that code is just subtracting the top from it to get its size -- blah+4 of course refers to AllocationBase, don't worry I won't hardcode offsets when coding it for real this was just a quickie I also don't like using invoke in real asm code...). What the code does is: it gets the "address of the last committed page" of the stack, uses it in VirtualQuery to get the "address of the base of the entire stack". For the other problem, I'm thinking a "design" of the prologue like this: Code: @@: add esp, 65536 ; needed else Windows kills the process, since the exception handler needs stack space too hlt ; raises exception, the handler deals with probing (actually it "returns" to a normal function but that's implementation detail, can't risk another fault within handler tho) TheFunction: ; push saved regs here sub esp, 65536 cmp esp, [fs:8] jb @b ; ... actual code The Exception Handler will recognize this pattern, and scan through the prologue to find out how much to subtract esp by, and then probe the stack, and finally return execution after the jb instruction with the proper stack. Of course I'll first check to see if it fits within the limit of the stack (found by VirtualQuery), if it doesn't well, stack can rest in peace (I'll just show a stack overflow message or something out of the handler). But before I implement it, I'd like to know your ideas, is it good or am I missing something obvious? If you got any elegant tips or something (again: requirement is to be as MINIMALLY invasive on "normal execution not requiring probing" as possible -- if it's slow when it probes the stack is not a problem since it happens ONCE, but this DSP function gets called many times per second on multiple threads, depending on buffer size, with 1 sample it's like 44100 times per sec and I want to keep its CPU usage low since it's realtime and one out of many effects) I know you love testing/profiling, but unfortunately I cannot test it here. I mean, I did test it in a loop, but there's literally no difference, you know why? The misprediction happens only for the first time or first few times, so in a testing loop with 1<<31 iterations, it won't make a dent. However in real code, this won't be ran in a loop, but of course on every call to the function. It may not be a big deal, but if it's a "free change" then why not? I don't see a way to find out if a branch was predicted correctly or not... maybe with rdtscp since it blocks speculative execution? (and I don't have the actual function's code yet to test, just open to speculation so far) BTW, I realize I can allocate much more of the stack if I have the information from VirtualQuery and not bother with the "cmp/jb" at all and have all the stack allocated always. But that allocates stack space for each thread, seems a bit wasteful, especially if the host increases reserved stack size "just for the heck of it". Two instructions shouldn't really matter, assuming the branch gets predicted to fall-through (not taken). I hope. NOTE: This alternative is also dangerous if I use too high a stack, as the stack pointer could go below the stack entirely, and into something else (Heap?) and not crash at all but corrupt random data (because it points to valid pages). That's bad. So I'll still go with the cmp/jb and hlt idea I guess, much safer, since I don't control the incoming stack or its size. I'd like it to be as "bulletproof" as possible in any stupid host that can use my plugin. |
|||
13 Sep 2017, 20:32 |
|
revolution 13 Sep 2017, 20:42
I suppose an alternative could be to simply use the stack without testing anything, and in your exception handler do the probing.
Code: TheFunction: ; push saved regs here sub esp, 65536 ;start using the stack now ;if there is a memory violation then use SEH to recover and do the probing. You could also replace your HLT with a call to the probing function and a jmp to reenter the function where it left off. Code: @@: add esp, 65536 call prober_thingy jmp @f TheFunction: ; push saved regs here sub esp, 65536 cmp esp, [fs:8] jb @b @@: ; ... actual code |
|||
13 Sep 2017, 20:42 |
|
Furs 13 Sep 2017, 20:52
I tried it the first time but it doesn't work for two reasons:
1) Windows *wants* the stack (at esp) to be available when exception occurs; SEH uses *this* stack to call your code, not a different one, which sucks. What happens is Windows will just kill the process (Wine seems to work fine but Wine apparently allocates all stack on default thread?) 2) It's dangerous if I use something larger than 65536 -- imagine I subtract esp below the entire stack, into valid memory (Heap?), then there won't be any exception, only silent corruption of random data (this was a huge thing in earlier Linux versions, not sure if Windows suffers from it, but yeah, it sounds bad). To solve (1) I could instead probe just the end of the stack I want before subtracting, i.e.: Code: test eax, [esp-65536] sub esp, 65536 The call sounds good, not sure why I hadn't thought of it, perhaps I still had in my mind to "not touch the stack"... thanks |
|||
13 Sep 2017, 20:52 |
|
revolution 13 Sep 2017, 21:08
Hmm, I see:
TFM wrote: The exception handler specified by lpTopLevelExceptionFilter is executed in the context of the thread that caused the fault. This can affect the exception handler's ability to recover from certain exceptions, such as an invalid stack. |
|||
13 Sep 2017, 21:08 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.