flat assembler
Message board for the users of flat assembler.

Index > Windows > Memory Block calculations

Author
Thread Post new topic Reply to topic
sq4²



Joined: 31 Jul 2005
Posts: 13
sq4² 31 Jul 2005, 23:03
First of all : I am new to fasm...

I want to optimize a process in a realtime thread. (written in purebasic)

The process takes a pointer to a memory block of variable length. (between 1 and 64k)
This memory block is filled with 32bit floats.
Every float has to be multiplied by a constant value -> [2^x], where x >0 and x<32)

Somebody any idea what could be the fastest routine?
Now I do it in a for next loop with peek and poke.
Post 31 Jul 2005, 23:03
View user's profile Send private message Reply with quote
comrade



Joined: 16 Jun 2003
Posts: 1150
Location: Russian Federation
comrade 01 Aug 2005, 03:40
SSE
Post 01 Aug 2005, 03:40
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger ICQ Number Reply with quote
sq4²



Joined: 31 Jul 2005
Posts: 13
sq4² 01 Aug 2005, 09:51
Can you point me in the right direction?
(SSE related)

Pseudo code :
Code:
  ptrSrc = start of source memory block
  ptrDst = start of destination memory block
  Constant=(2^15)
  For i = 0 to blocksize-1
    Float = PeekFloat(ptrSrc)
    PokeLong(ptrDst,Float*Constant)
    ptrSrc+(i*4) : ptrDst+(i*4)
  next
    

Thanks.
Post 01 Aug 2005, 09:51
View user's profile Send private message Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 602
Location: Germany
MCD 01 Aug 2005, 13:22
It's not completely clear for me whether those should be constants or variables in your pseudo-code:
ptrSrc
ptrDst
Constant
blocksize

If those are variables, than that's perhaps what he meant:
Code:
mov esi,[ptrSrc]
mov edi,[ptrDst]
mov ecx,[blocksize]
sub esi,ecx
sub edi,ecx
movss xmm1,[constant]
unpckps xmm1,xmm1
movlhps xmm1,xmm1

.MulLoop:
movups xmm0,[esi+ecx]
mulps xmm0,xmm1
movups [edi+ecx],xmm0
add ecx,16
jnc .MulLoop
    

Note: Both Src and Dst must be a multiple of 16 byte long, and should be aligned on a 16byte boundary to speed it up, so you can use this code:
Code:
mov esi,[ptrSrc]
mov edi,[ptrDst]
mov ecx,[blocksize]
sub esi,ecx
sub edi,ecx
movss xmm1,[constant]
unpckps xmm1,xmm1
movlhps xmm1,xmm1

.MulLoop:
movaps xmm0,[esi+ecx]
mulps xmm0,xmm1
movaps [edi+ecx],xmm0 ; .1)
add ecx,16
jnc .MulLoop
    

If you know that your floats aren't to be used anytime soon, you can also replace the line 1.) with "movntps [edi+ecx],xmm0".
If that isn't fast enough, you can further unroll the loop and process multiple multiplications at once, in different registers, like this, but data must be 64byte aligned than:
Code:
mov esi,[ptrSrc]
mov edi,[ptrDst]
mov ecx,[blocksize]
sub esi,ecx
sub edi,ecx
movss xmm7,[constant]
unpckps xmm7,xmm7
movlhps xmm7,xmm7

.MulLoop:
movaps xmm0,[esi+ecx]
mulps xmm0,xmm7
movaps xmm1,[esi+ecx+16]
mulps xmm1,xmm7
movaps xmm2,[esi+ecx+32]
mulps xmm2,xmm7
movaps xmm3,[esi+ecx+48]
mulps xmm3,xmm7
movaps [edi+ecx],xmm0 ;1.)
movaps [edi+ecx+16],xmm0 ;1.)
movaps [edi+ecx+32],xmm0 ;1.)
movaps [edi+ecx+48],xmm0 ;1.)
add ecx,64
jnc .MulLoop
    

It all just depends on how much data you have to process.

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 01 Aug 2005, 13:22
View user's profile Send private message Reply with quote
sq4²



Joined: 31 Jul 2005
Posts: 13
sq4² 02 Aug 2005, 23:55
thanks alot
i'll try it and let you know.
Post 02 Aug 2005, 23:55
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.