flat assembler
Message board for the users of flat assembler.
Index
> Main > Graphics optimization? |
Author |
|
r22 13 Sep 2007, 18:47
re: Second one
If the buffer is 16 byte aligned using MOVDQA dqword[Buffer+Offset],xmmreg is the fastest way. Zero the xmmregister with ie: PXOR xmm0, xmm0 Unroll the loop to clear 64 bytes at a time, thats usually the optimal stride for SSE. As for your first question, SSE would speed your task up, if no one else helps you out with some example code I'll try to get to it later tonight (at work right now). |
|||
13 Sep 2007, 18:47 |
|
r22 14 Sep 2007, 01:46
Looking at the Blit snippet you provided I realized moving to SSE would be a little difficult because of the dependencies, but heres a parallelized version of your snippet in SSE.
Because you can't easily skip over the destination in SSE we have to load it, so it adds an extra READ(src) READ(dst) WRITE(dst) as opposed to the general purpose READ(src) IF? WRITE(dst) Code: TransparentBlit: ;;esp+4 = dst (ASSUME 16 byte aligned) ;;esp+8 = src (ASSUME 16 byte aligned) ;;esp+12 = key ;;esp+16= length ;;SSUME length is in DWORDs and is a multiple of PUSH ebx MOV ecx,[esp+16+4] MOV ebx,[sp+12+4] MOV edx,[esp+8+4] MOV eax,[esp+4+4] MOVD xmm3,ebx SUB ecx,8 PSHUFD xmm3,xmm3,0 ;;copy key|key|key|key PXOR xmm4,xmm4 ;;zero .lp: MOVDQA xmm0,dqword[eax+ecx*4] MOVDQA xmm5,dqword[eax+ecx*4] MOVDQA xmm2,xmm0 MOVDQA xmm6,xmm5 PCMPEQD xmm0,xmm3 PCMPEQD xmm5,xmm3 PAND xmm0,dqword[edx+ecx*4] PAND xmm5,dqword[edx+ecx*4+16] MOVDQA xmm1,xmm0 MOVDQA xmm7,xmm5 PCMPEQD xmm1,xmm4 PCMPEQD xmm7,xmm4 PAND xmm1,xmm2 PAND xmm7,xmm6 POR xmm0,xmm1 POR xmm5,xmm7 MOVDQA dqword[edx+ecx*4],xmm0 MOVDQA dqword[edx+ecx*4+16],xmm0 SUB ecx,8 JNS .lp POP ebx RET 16 I haven't tested this or compiled it (written in reply text box) but hopefully you can learn from it. Code: ZeroBuffer: ;;esp+4 = Buffer Address (ASSUMING buffer address is 16 byte aligned) ;;esp+8 = Length (ASSUMING length is a multiple of 64) PXOR xmm0,xmm0 MOV ecx,[esp+8] MOV eax,[esp+4] SUB ecx,64 .lp: MOVDQA [eax+ecx+48],xmm0 MOVDQA [eax+ecx+32],xmm0 MOVDQA [eax+ecx+16],xmm0 MOVDQA [eax+ecx+],xmm0 SUB ecx,64 JNS .lp RET 8 |
|||
14 Sep 2007, 01:46 |
|
xspeed 14 Sep 2007, 16:23
here you go short,sweet,and fast
--------------------how it would look with stosw mov ecx,length mov edi,dest mov esi,source b1: lodsd mov edx,eax xor edx,key jz b2 stosd jmp b3 b2: add edi,4 b3: dec ecx jnz b1 there is one with movsw that trnblt a 1mb bmp in a matter of 10 ticks, i forgot trying to remeber it you may wants to push and pop to save you register value otherwise your program going to crash Last edited by xspeed on 17 Sep 2007, 18:01; edited 1 time in total |
|||
14 Sep 2007, 16:23 |
|
Hayden 16 Sep 2007, 11:09
here is a fast pixel proc for mode 13, maybe someone could modify it
to work with other modes too. Code: ; very fast pixel proc for mode 13h ; btw, #A0000 / 8 = #14000 mov ebx, dword x mov edx, dword y mov cl , byte color lea edx, [edx + edx*4] lea edx, [edx*8 + 14000H] mov [ebx + edx*8], cl _________________ New User.. Hayden McKay. |
|||
16 Sep 2007, 11:09 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.