flat assembler
Message board for the users of flat assembler.
Index
> Windows > wndproc |
Author |
|
f0dder 26 Jun 2007, 22:33
optimizing a wndproc is a waste of time
What do you want to "optimize" it for anyway? Speed or size? (speed is 100% waste of time for a wndproc). |
|||
26 Jun 2007, 22:33 |
|
LocoDelAssembly 26 Jun 2007, 23:29
Well, the "sub esp,20" has nothing to do there. Look what really happens
Code: WindowProc: push ebp mov ebp, esp . . . sub esp, 20 leave ; mov esp, ebp | pop ebp jmp [DefWindowProc] As you can see, the work done by "sub esp, 20" is inmediatelly destroyed by the "leave" instruction. For certain instructions that uses EAX tends to encode in fewer bytes than when using another register. And yes, is probably a waste of time optimizing this because is very probable that the benefits cannot be measured with your eye. Optimizing for size probably is the best to do if you are interested in optimizing. |
|||
26 Jun 2007, 23:29 |
|
hologram 29 Jun 2007, 06:36
So why does it work ?
|
|||
29 Jun 2007, 06:36 |
|
f0dder 29 Jun 2007, 12:20
hologram wrote: So why does it work ? Look up how the "leave" instruction works _________________ - carpe noctem |
|||
29 Jun 2007, 12:20 |
|
LocoDelAssembly 29 Jun 2007, 13:44
Quote:
Code: sub esp, 20 leave ; mov esp, ebp | pop ebp Pascal-like equivalent Code: esp := esp - 20; { sub esp, 20 } {*** leave ***} esp := ebp; { mov esp, ebp } ebp := esp^; esp := esp + 4; { pop ebp } {*** end of leave instruction *** } |
|||
29 Jun 2007, 13:44 |
|
r22 30 Jun 2007, 06:32
If you had "A LOT" of window messages to check I could see using a jump table as opposed to a repeated cmp and jz 's. A jump table would increase the size of your code, BUT (if you had A LOT of messages to check) "could" noticably improve performance.
Example of a JumpTable Code: JMP_TABLE: dd jmpdefault ;;if wmsg = 0 dd jmpdefault ;;if wmsg = 1 dd wm_destroy ;;if wmsg = 2 = WM_DESTROY EJMP_TABLE: ... mov edx,[wmsg] ;;make sure the value isn't greater than the number of entries ;;in your jump table (3 in this example) cmp edx, (EJMP_TABLE - JMP_TABLE) SHR 2 jae [DefWindowProc] jmp dword[JMP_TABLE + edx*4] ... jmpdefault: jmp [DefWindowProc] |
|||
30 Jun 2007, 06:32 |
|
m 30 Jun 2007, 10:59
How many times a jum-table is better than a straight cmp-jmp chain ?
|
|||
30 Jun 2007, 10:59 |
|
f0dder 30 Jun 2007, 22:28
Jump table decreases code size, but increases data size - and easily pretty dramatically much. You could end up getting some heavy cache trashing, so the CMP+JE (please, JE and not JZ, more logical mnemonic) sequence is probably smarter.
You could even have the few messages that are called most of the time with a CMP+JE sequence, and then split into the binary tree approach for the rest - but again, this is fucking overkill for something as speed-insensitive as a wndproc |
|||
30 Jun 2007, 22:28 |
|
Enko 30 Jun 2007, 23:03
Quote:
Aren't they synonims? like "dark" and "black" almoste the same. |
|||
30 Jun 2007, 23:03 |
|
LocoDelAssembly 01 Jul 2007, 00:09
Yep, both mnemonics correspond to the same instruction. f0dder means that looks ilogical comparing values and jumping if the comparison results zero instead of equal. It recovers some sense when you remember that CMP is SUB without saving result in destination though
About the jump table, it is not so good when the consecutives values are to few, you have to perform a mix of binary tree approach with it. I'm not so sure about f0dder says about cache trashing, after all, code waste cache too and micro-arquitecture jokes like the one Pentium4 has polutes the trace cache much more with branching because it tends to store the same decoded instruction more than once. |
|||
01 Jul 2007, 00:09 |
|
r22 01 Jul 2007, 05:03
WM_'s go from 0 - 1023 (not including USER DEFINED).
Each entry in the jump table would take up 4 bytes on a 32bit system. 1024*4 = 4096 Bytes or 4KB Good size for a LUT I would probably only implement a jump table for the wndproc if I had ~30 or more messages that I wanted to handle. Any less than that and it would be just be over kill IMHO. But it's an interesting subject. Most of us just assume from experience and how the windowing system works that optimizing the wndproc would be pointless BUT I don't think we've ever actually setup a benchmark to know once and for all. Maybe in an opengl program you'd want an optimal windowproc for key handling... odd that the fasm ddraw example works but the opengl example doesnt work on my win xp64 box. |
|||
01 Jul 2007, 05:03 |
|
f0dder 01 Jul 2007, 09:44
LocoDelAssembly: IMHO "JE" is more logic than "JZ" in this particular case, since you're (logically) checking for equality, not zero.
Quote:
Well, you waste less cache with the usually rather small amount of messages you have to handle. The CMP/JE sequence takes max 11 bytes per message (7 bytes for those that fit within +127 byte range), meaning that you'd need >370 messages to take up the same space as a jump table Hadn't thought of the trace cache, that could be an issue as well (but how much, really? How much effect does it have, considering the amount of API code etc. your wndproc goes through?) r22 wrote:
It's going to be pretty damn hard to do any timings, even if you find a really old box. I'm not sure how I'd even go about setting up the timing... r22 wrote:
Don't really think so - even on a 100mhz 486, key input is infinitely slow... slow in the sense that you're dealing with a pathetic human being with all our mechanical limitations The only thing I can really think of where wndproc handling could matter (ie, with enough messages coming at a fast rate) would be WM_* async socket stuff... but that sucks bigtime, and it's bottleneck lies in the whole message system rather than how you code your wndproc anyway. Not saying that jump tables can't be good, though - they can be really nifty when dealing with byte range input, for instance. |
|||
01 Jul 2007, 09:44 |
|
LocoDelAssembly 01 Jul 2007, 15:26
Quote:
And I agree, what I said about JZ was it is not an intentional obfuscation because CMP is a substraction and when both values are equal the result is zero. About the trace cache for some reason I can't find the long and good explanation in Agner Fog's manual Still, it says this Agner Fog - Optimizing Assembly wrote: Microprocessors with a trace cache are likely to store multiple instances of the same About the 4 KB table, if utilized, then f0dder is right about cache trashing (unless very few cache lines boundaries gets accessed). |
|||
01 Jul 2007, 15:26 |
|
f0dder 01 Jul 2007, 16:30
LocoDelAssembly wrote:
You'd have to look at the distribution of WM_* messages handled by your program to answer that one |
|||
01 Jul 2007, 16:30 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.