flat assembler
Message board for the users of flat assembler.

Index > Windows > wndproc

Author
Thread Post new topic Reply to topic
hologram



Joined: 26 Jun 2007
Posts: 4
hologram 26 Jun 2007, 20:30
How can I optimise the wndproc more ?
This is my example. Anybody uses anything else ? Simplest example.

Code:
proc    WindowProc hwnd,wmsg,wparam,lparam

        mov     edx,[wmsg]
        cmp     edx,WM_KEYDOWN
        jz      wmkeydown
        cmp     edx,WM_DESTROY
        jz      wmdestroy
        sub     esp,20
        leave
        jmp     [DefWindowProc]
wmkeydown:
        mov     edx,[wparam]
        cmp     edx,VK_ESCAPE
        jnz     return_0
wmdestroy:
        invoke  PostQuitMessage,0
return_0:
        xor     eax,eax
        ret
endp
    
Post 26 Jun 2007, 20:30
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 26 Jun 2007, 22:33
optimizing a wndproc is a waste of time Smile

What do you want to "optimize" it for anyway? Speed or size? (speed is 100% waste of time for a wndproc).
Post 26 Jun 2007, 22:33
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 26 Jun 2007, 23:29
Well, the "sub esp,20" has nothing to do there. Look what really happens

Code:
WindowProc:
  push ebp
  mov  ebp, esp
  .
  .
  .
  sub  esp, 20
  leave ; mov esp, ebp | pop ebp
  jmp [DefWindowProc]    

As you can see, the work done by "sub esp, 20" is inmediatelly destroyed by the "leave" instruction.

For certain instructions that uses EAX tends to encode in fewer bytes than when using another register.

And yes, is probably a waste of time optimizing this because is very probable that the benefits cannot be measured with your eye. Optimizing for size probably is the best to do if you are interested in optimizing.
Post 26 Jun 2007, 23:29
View user's profile Send private message Reply with quote
hologram



Joined: 26 Jun 2007
Posts: 4
hologram 29 Jun 2007, 06:36
So why does it work ?
Post 29 Jun 2007, 06:36
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 29 Jun 2007, 12:20
hologram wrote:
So why does it work ?


Look up how the "leave" instruction works Smile

_________________
Image - carpe noctem
Post 29 Jun 2007, 12:20
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 29 Jun 2007, 13:44
Quote:

So why does it work ?

Code:
  sub  esp, 20 
  leave ; mov esp, ebp | pop ebp     

Pascal-like equivalent
Code:
esp := esp - 20; { sub esp, 20 }
{*** leave ***}
esp := ebp; { mov esp, ebp }
ebp := esp^; esp := esp + 4; { pop ebp }
{*** end of leave instruction *** }    
Post 29 Jun 2007, 13:44
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 30 Jun 2007, 06:32
If you had "A LOT" of window messages to check I could see using a jump table as opposed to a repeated cmp and jz 's. A jump table would increase the size of your code, BUT (if you had A LOT of messages to check) "could" noticably improve performance.

Example of a JumpTable
Code:
JMP_TABLE:
dd jmpdefault ;;if wmsg = 0
dd jmpdefault ;;if wmsg = 1
dd wm_destroy ;;if wmsg = 2 = WM_DESTROY
EJMP_TABLE:
...
        mov edx,[wmsg]
;;make sure the value isn't greater than the number of entries
;;in your jump table (3 in this example)
        cmp edx, (EJMP_TABLE - JMP_TABLE) SHR 2
        jae [DefWindowProc]
        jmp dword[JMP_TABLE + edx*4]
...
jmpdefault:
        jmp [DefWindowProc]
    
Post 30 Jun 2007, 06:32
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
m



Joined: 28 Dec 2006
Posts: 304
Location: in
m 30 Jun 2007, 10:59
How many times a jum-table is better than a straight cmp-jmp chain ?
Post 30 Jun 2007, 10:59
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 30 Jun 2007, 22:28
Jump table decreases code size, but increases data size - and easily pretty dramatically much. You could end up getting some heavy cache trashing, so the CMP+JE (please, JE and not JZ, more logical mnemonic) sequence is probably smarter.

You could even have the few messages that are called most of the time with a CMP+JE sequence, and then split into the binary tree approach for the rest - but again, this is fucking overkill for something as speed-insensitive as a wndproc Smile
Post 30 Jun 2007, 22:28
View user's profile Send private message Visit poster's website Reply with quote
Enko



Joined: 03 Apr 2007
Posts: 676
Location: Mar del Plata
Enko 30 Jun 2007, 23:03
Quote:

JE and not JZ

Aren't they synonims? like "dark" and "black" almoste the same.
Post 30 Jun 2007, 23:03
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 01 Jul 2007, 00:09
Yep, both mnemonics correspond to the same instruction. f0dder means that looks ilogical comparing values and jumping if the comparison results zero instead of equal. It recovers some sense when you remember that CMP is SUB without saving result in destination though Smile

About the jump table, it is not so good when the consecutives values are to few, you have to perform a mix of binary tree approach with it.

I'm not so sure about f0dder says about cache trashing, after all, code waste cache too and micro-arquitecture jokes like the one Pentium4 has polutes the trace cache much more with branching because it tends to store the same decoded instruction more than once.
Post 01 Jul 2007, 00:09
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 01 Jul 2007, 05:03
WM_'s go from 0 - 1023 (not including USER DEFINED).
Each entry in the jump table would take up 4 bytes on a 32bit system.

1024*4 = 4096 Bytes or 4KB
Good size for a LUT

I would probably only implement a jump table for the wndproc if I had ~30 or more messages that I wanted to handle. Any less than that and it would be just be over kill IMHO.

But it's an interesting subject. Most of us just assume from experience and how the windowing system works that optimizing the wndproc would be pointless BUT I don't think we've ever actually setup a benchmark to know once and for all.

Maybe in an opengl program you'd want an optimal windowproc for key handling... odd that the fasm ddraw example works but the opengl example doesnt work on my win xp64 box.
Post 01 Jul 2007, 05:03
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 01 Jul 2007, 09:44
LocoDelAssembly: IMHO "JE" is more logic than "JZ" in this particular case, since you're (logically) checking for equality, not zero.

Quote:

I'm not so sure about f0dder says about cache trashing, after all, code waste cache too and micro-arquitecture jokes like the one Pentium4 has polutes the trace cache much more with branching because it tends to store the same decoded instruction more than once.

Well, you waste less cache with the usually rather small amount of messages you have to handle. The CMP/JE sequence takes max 11 bytes per message (7 bytes for those that fit within +127 byte range), meaning that you'd need >370 messages to take up the same space as a jump table Smile

Hadn't thought of the trace cache, that could be an issue as well (but how much, really? How much effect does it have, considering the amount of API code etc. your wndproc goes through?)

r22 wrote:

But it's an interesting subject. Most of us just assume from experience and how the windowing system works that optimizing the wndproc would be pointless BUT I don't think we've ever actually setup a benchmark to know once and for all.

It's going to be pretty damn hard to do any timings, even if you find a really old box. I'm not sure how I'd even go about setting up the timing...

r22 wrote:

Maybe in an opengl program you'd want an optimal windowproc for key handling... odd that the fasm ddraw example works but the opengl example doesnt work on my win xp64 box.

Don't really think so - even on a 100mhz 486, key input is infinitely slow... slow in the sense that you're dealing with a pathetic human being with all our mechanical limitations Razz

The only thing I can really think of where wndproc handling could matter (ie, with enough messages coming at a fast rate) would be WM_* async socket stuff... but that sucks bigtime, and it's bottleneck lies in the whole message system rather than how you code your wndproc anyway.

Not saying that jump tables can't be good, though - they can be really nifty when dealing with byte range input, for instance.
Post 01 Jul 2007, 09:44
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 01 Jul 2007, 15:26
Quote:

LocoDelAssembly: IMHO "JE" is more logic than "JZ" in this particular case, since you're (logically) checking for equality, not zero.

And I agree, what I said about JZ was it is not an intentional obfuscation because CMP is a substraction and when both values are equal the result is zero.

About the trace cache for some reason I can't find the long and good explanation in Agner Fog's manual Confused

Still, it says this
Agner Fog - Optimizing Assembly wrote:
Microprocessors with a trace cache are likely to store multiple instances of the same
code in the trace cache when the code contains many jumps.


About the 4 KB table, if utilized, then f0dder is right about cache trashing (unless very few cache lines boundaries gets accessed).
Post 01 Jul 2007, 15:26
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 01 Jul 2007, 16:30
LocoDelAssembly wrote:

About the 4 KB table, if utilized, then f0dder is right about cache trashing (unless very few cache lines boundaries gets accessed).

You'd have to look at the distribution of WM_* messages handled by your program to answer that one Smile
Post 01 Jul 2007, 16:30
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.