flat assembler
Message board for the users of flat assembler.

Index > Main > Graphics optimization?

Author
Thread Post new topic Reply to topic
Artlav



Joined: 23 Dec 2004
Posts: 188
Location: Moscow, Russia
Artlav 13 Sep 2007, 15:40
First one, what is the fastest way of doing 32bit transparent blitting?

Specifically, rgba color format (0xaabbggrr), if target pixel color == colorkey then leave the target pixel, else put source pixel.

Maybe something based on MMX/SSE?

Best compiler can do:
Code:
mov eax,length
mov ecx,dst
mov ebx,src
lp:
mov edx,[ebx]
mov esi,edx
xor esi,key
jz lp1
mov [ecx],edx
lp1:
add ebx,4
add ecx,4

dec eax
jnz lp
    


Second one, what is the fastest way of filling a buffer of dwords with zeros?
Anything faster than using a writing part of MMX memory copy procedure?


Third one, if this is not a proper place to ask this kind of questions, where is a proper place?

Thank you.


Last edited by Artlav on 13 Sep 2007, 20:36; edited 1 time in total
Post 13 Sep 2007, 15:40
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 13 Sep 2007, 18:47
re: Second one

If the buffer is 16 byte aligned using MOVDQA dqword[Buffer+Offset],xmmreg is the fastest way.

Zero the xmmregister with ie: PXOR xmm0, xmm0

Unroll the loop to clear 64 bytes at a time, thats usually the optimal stride for SSE.

As for your first question, SSE would speed your task up, if no one else helps you out with some example code I'll try to get to it later tonight (at work right now).
Post 13 Sep 2007, 18:47
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 14 Sep 2007, 01:46
Looking at the Blit snippet you provided I realized moving to SSE would be a little difficult because of the dependencies, but heres a parallelized version of your snippet in SSE.
Because you can't easily skip over the destination in SSE we have to load it, so it adds an extra READ(src) READ(dst) WRITE(dst) as opposed to the general purpose READ(src) IF? WRITE(dst)
Code:
TransparentBlit:
;;esp+4 = dst (ASSUME 16 byte aligned)
;;esp+8 = src (ASSUME 16 byte aligned)
;;esp+12 = key
;;esp+16= length 
;;SSUME length is in DWORDs and is a multiple of Cool
PUSH ebx
MOV ecx,[esp+16+4]
MOV ebx,[sp+12+4]
MOV edx,[esp+8+4]
MOV eax,[esp+4+4]
MOVD xmm3,ebx
SUB ecx,8
PSHUFD xmm3,xmm3,0 ;;copy key|key|key|key
PXOR xmm4,xmm4 ;;zero
.lp:
MOVDQA xmm0,dqword[eax+ecx*4]
MOVDQA xmm5,dqword[eax+ecx*4]
MOVDQA xmm2,xmm0
MOVDQA xmm6,xmm5
PCMPEQD xmm0,xmm3
PCMPEQD xmm5,xmm3
PAND xmm0,dqword[edx+ecx*4]
PAND xmm5,dqword[edx+ecx*4+16]
MOVDQA xmm1,xmm0
MOVDQA xmm7,xmm5
PCMPEQD xmm1,xmm4
PCMPEQD xmm7,xmm4
PAND xmm1,xmm2
PAND xmm7,xmm6
POR xmm0,xmm1
POR xmm5,xmm7
MOVDQA dqword[edx+ecx*4],xmm0
MOVDQA dqword[edx+ecx*4+16],xmm0
SUB ecx,8
JNS .lp
POP ebx
RET 16
    

I haven't tested this or compiled it (written in reply text box) but hopefully you can learn from it.

Code:
ZeroBuffer:
;;esp+4 = Buffer Address (ASSUMING buffer address is 16 byte aligned)
;;esp+8 = Length (ASSUMING length is a multiple of 64)
PXOR xmm0,xmm0
MOV ecx,[esp+8]
MOV eax,[esp+4]
SUB ecx,64
.lp:
MOVDQA [eax+ecx+48],xmm0
MOVDQA [eax+ecx+32],xmm0
MOVDQA [eax+ecx+16],xmm0
MOVDQA [eax+ecx+],xmm0
SUB ecx,64
JNS .lp
RET 8
    
Post 14 Sep 2007, 01:46
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
xspeed



Joined: 16 Aug 2007
Posts: 22
xspeed 14 Sep 2007, 16:23
here you go short,sweet,and fast

--------------------how it would look with stosw
mov ecx,length
mov edi,dest
mov esi,source

b1:
lodsd
mov edx,eax
xor edx,key
jz b2
stosd
jmp b3
b2:
add edi,4
b3:
dec ecx
jnz b1

there is one with movsw that trnblt a 1mb bmp in a matter of 10 ticks, i forgot trying to remeber it

you may wants to push and pop to save you register value otherwise your program going to crash Shocked


Last edited by xspeed on 17 Sep 2007, 18:01; edited 1 time in total
Post 14 Sep 2007, 16:23
View user's profile Send private message Reply with quote
Hayden



Joined: 06 Oct 2005
Posts: 132
Hayden 16 Sep 2007, 11:09
here is a fast pixel proc for mode 13, maybe someone could modify it
to work with other modes too.

Code:
; very fast pixel proc for mode 13h
; btw, #A0000 / 8 = #14000

    mov  ebx, dword x
    mov  edx, dword y
    mov  cl , byte color
    lea edx, [edx + edx*4]
    lea edx, [edx*8 + 14000H]
    mov [ebx + edx*8], cl
    

_________________
New User.. Hayden McKay.
Post 16 Sep 2007, 11:09
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.