Message board for the users of flat assembler.
> Main > Graphics optimization?
First one, what is the fastest way of doing 32bit transparent blitting?
Specifically, rgba color format (0xaabbggrr), if target pixel color == colorkey then leave the target pixel, else put source pixel.
Maybe something based on MMX/SSE?
Best compiler can do:
mov eax,length mov ecx,dst mov ebx,src lp: mov edx,[ebx] mov esi,edx xor esi,key jz lp1 mov [ecx],edx lp1: add ebx,4 add ecx,4 dec eax jnz lp
Second one, what is the fastest way of filling a buffer of dwords with zeros?
Anything faster than using a writing part of MMX memory copy procedure?
Third one, if this is not a proper place to ask this kind of questions, where is a proper place?
Last edited by Artlav on 13 Sep 2007, 20:36; edited 1 time in total
|13 Sep 2007, 15:40||
re: Second one
If the buffer is 16 byte aligned using MOVDQA dqword[Buffer+Offset],xmmreg is the fastest way.
Zero the xmmregister with ie: PXOR xmm0, xmm0
Unroll the loop to clear 64 bytes at a time, thats usually the optimal stride for SSE.
As for your first question, SSE would speed your task up, if no one else helps you out with some example code I'll try to get to it later tonight (at work right now).
|13 Sep 2007, 18:47||
Looking at the Blit snippet you provided I realized moving to SSE would be a little difficult because of the dependencies, but heres a parallelized version of your snippet in SSE.
Because you can't easily skip over the destination in SSE we have to load it, so it adds an extra READ(src) READ(dst) WRITE(dst) as opposed to the general purpose READ(src) IF? WRITE(dst)
TransparentBlit: ;;esp+4 = dst (ASSUME 16 byte aligned) ;;esp+8 = src (ASSUME 16 byte aligned) ;;esp+12 = key ;;esp+16= length ;;SSUME length is in DWORDs and is a multiple of PUSH ebx MOV ecx,[esp+16+4] MOV ebx,[sp+12+4] MOV edx,[esp+8+4] MOV eax,[esp+4+4] MOVD xmm3,ebx SUB ecx,8 PSHUFD xmm3,xmm3,0 ;;copy key|key|key|key PXOR xmm4,xmm4 ;;zero .lp: MOVDQA xmm0,dqword[eax+ecx*4] MOVDQA xmm5,dqword[eax+ecx*4] MOVDQA xmm2,xmm0 MOVDQA xmm6,xmm5 PCMPEQD xmm0,xmm3 PCMPEQD xmm5,xmm3 PAND xmm0,dqword[edx+ecx*4] PAND xmm5,dqword[edx+ecx*4+16] MOVDQA xmm1,xmm0 MOVDQA xmm7,xmm5 PCMPEQD xmm1,xmm4 PCMPEQD xmm7,xmm4 PAND xmm1,xmm2 PAND xmm7,xmm6 POR xmm0,xmm1 POR xmm5,xmm7 MOVDQA dqword[edx+ecx*4],xmm0 MOVDQA dqword[edx+ecx*4+16],xmm0 SUB ecx,8 JNS .lp POP ebx RET 16
I haven't tested this or compiled it (written in reply text box) but hopefully you can learn from it.
ZeroBuffer: ;;esp+4 = Buffer Address (ASSUMING buffer address is 16 byte aligned) ;;esp+8 = Length (ASSUMING length is a multiple of 64) PXOR xmm0,xmm0 MOV ecx,[esp+8] MOV eax,[esp+4] SUB ecx,64 .lp: MOVDQA [eax+ecx+48],xmm0 MOVDQA [eax+ecx+32],xmm0 MOVDQA [eax+ecx+16],xmm0 MOVDQA [eax+ecx+],xmm0 SUB ecx,64 JNS .lp RET 8
|14 Sep 2007, 01:46||
here you go short,sweet,and fast
--------------------how it would look with stosw
there is one with movsw that trnblt a 1mb bmp in a matter of 10 ticks, i forgot trying to remeber it
you may wants to push and pop to save you register value otherwise your program going to crash
Last edited by xspeed on 17 Sep 2007, 18:01; edited 1 time in total
|14 Sep 2007, 16:23||
here is a fast pixel proc for mode 13, maybe someone could modify it
to work with other modes too.
; very fast pixel proc for mode 13h ; btw, #A0000 / 8 = #14000 mov ebx, dword x mov edx, dword y mov cl , byte color lea edx, [edx + edx*4] lea edx, [edx*8 + 14000H] mov [ebx + edx*8], cl
New User.. Hayden McKay.
|16 Sep 2007, 11:09||
< Last Thread | Next Thread >
Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.