flat assembler
Message board for the users of flat assembler.

Index > Main > Is there an existent vector library for fasm?

Author
Thread Post new topic Reply to topic
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
Did anyone write a vector library for/in fasm? specially vec2, etc.

I need this and since it's quite a trivial piece of code I was wondering if someone else already wrote it. Although I'm interested in learning how can such a library be optimized for speed efficiency.

Said library usually includes: set, add, sub, mul, dot, etc. (also neg, mag, unit, norm, project )

I could probably write it down in a few hours (notice I'm quite the newbie in assembly) but then again, if it exists then why reinvent it?. I want to focus on the game itself rather than the core elements at the moment Confused
Post 14 Nov 2008, 07:37
View user's profile Send private message Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
Remember that you can use C libraries from Assembly, and such optimized vector libraries would be written in assembly anyway. It's just not likely that some existing library would contain headers for FASM (or, any other assembly syntax), so you would need to make those yourself, but that should be easy task.
Post 14 Nov 2008, 07:51
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
vid, thanks for the pointer. which optimized vector library do you recommend I take a look at? (any on the open source side without many license limitations?)
Post 14 Nov 2008, 07:53
View user's profile Send private message Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
I really can't help you with this, never did much vector stuff. If you are REALLY after optimization, it is probably better to write time-critical operations yourself (use MMX or SSE).

Otherwise I really don't know, but you could try to take a look at libtommath. It has a really nice "lowlevel" interface to be used from asm, and I think it has vector functions too (not sure).
Post 14 Nov 2008, 09:11
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
bitshifter



Joined: 04 Dec 2007
Posts: 764
Location: Massachusetts, USA
bitshifter
There are a few libraries around but i have not seen one for fasm.
I am in the process of writing a 3d vector library for fasm.
If you need some help on writing a 2d version i can assist.
Here are a few tricks im using in my library to make it easy
for someone to choose either 32 or 64 bit precision vectors.
Also noting how much pressure is put on the fpu stack and if it is restored to its original state.
(in some cases it is desirable to return a value on the fpu stack)
Code:
;------------------------------------------------
; 2d vector math library
;------------------------------------------------

; for 32 bit precision
VEC2_VALTYPE fix dd
VEC2_VALSIZE fix dword
VEC2_OFFSETY fix 4

; for 64 bit precision
;VEC2_VALTYPE fix dq
;VEC2_VALSIZE fix qword
;VEC2_OFFSETY fix 8

struct Vec2
   x VEC2_VALTYPE ?
   y VEC2_VALTYPE ?
ends

;------------------------------------------------
; Vec2 = Vec2 + Vec2
;
; internal fpu stack pressure = 1
; external fpu stack pressure = 0
; restores fpu stack state = true
;------------------------------------------------

macro Vec2_SumVec2 res,opa,opb
{
   fld  VEC2_VALSIZE[opa]
   fadd VEC2_VALSIZE[opb]
   fstp VEC2_VALSIZE[res]
   fld  VEC2_VALSIZE[opa+VEC2_OFFSETY]
   fadd VEC2_VALSIZE[opb+VEC2_OFFSETY]
   fstp VEC2_VALSIZE[res+VEC2_OFFSETY]
}

;------------------------------------------------
; Vec2 += Vec2
;
; internal fpu stack pressure = 1
; external fpu stack pressure = 0
; restores fpu stack state = true
;------------------------------------------------

macro Vec2_AddVec2 dst,src
{
   fld  VEC2_VALSIZE[src]
   fadd VEC2_VALSIZE[dst]
   fstp VEC2_VALSIZE[dst]
   fld  VEC2_VALSIZE[src+VEC2_OFFSETY]
   fadd VEC2_VALSIZE[dst+VEC2_OFFSETY]
   fstp VEC2_VALSIZE[dst+VEC2_OFFSETY]
}

    
Post 14 Nov 2008, 13:50
View user's profile Send private message Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
Hmm this is strange, I began working on the vector library - I thought about doing the structure a vec4 and later on using SIMD to speed things up but I think it's the opposite on most cases where not a lot of operations are required on the structure fields, is this right?

for instance I'm doing movaps for vector 'a' and 'b' on xmm0 and 1, then I do addps onto xmm2 and I movaps back to vector a (that's the one I wanted the result at) --- doing the same but with the fpu I get better results (in blocks of fld, fld, faddp, fstp for each field).

I ran a small benchmark and got the following results: 990ms vs 720ms ... fpu wins in this case, is this valid?, when or how should I use SIMD in this case?
Post 15 Nov 2008, 18:42
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:

for instance I'm doing movaps for vector 'a' and 'b' on xmm0 and 1, then I do addps onto xmm2 and I movaps back to vector a (that's the one I wanted the result at)

Could you post the code of this? Because I don't see how xmm2 is needed here.
Post 15 Nov 2008, 20:12
View user's profile Send private message Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
I tried just one movups and then addps with the address of the second vector but it crashes my application, that's why I had to use 2 movups and then add the registers

for what it's worth it this laptop only has SSE extensions, I don't know if it makes any difference.

I also tried moving the pointer in EAX to XMM0 and I couldn't, this has to be possible but for some reason it's crashing.

Code:
movups xmm0, dqword[v_veca]
movups xmm1, dqword[v_vecb]
addps xmm0, xmm1
movups dqword[v_veca], xmm0    


I mentioned 2 addps or the use of xmm2 before, my bad. (been goofing around with the code way too much)
Post 15 Nov 2008, 21:39
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
You could do this:
Code:
format pe gui 4.0
movaps xmm0, dqword [veca]
addps  xmm0, dqword [vecb]
movaps dqword [veca], xmm0
int3

align 16
veca dd 1.0, 2.0, 0.0, 0.0
vecb dd 3.0, 4.0, 0.0, 0.0    


Note that every memory access must be dqword aligned (16), the exception is movups but it is suboptimal. Also, note that we are wasting half of the processing power here by padding with zeroes the vectors, the idea here would be using an array of vectors to make the SSE much worth to be used.

Code:
format pe gui 4.0
ARRAY_SIZE = 2 ; 2 dqwords
mov    edi, vec_array_a + ARRAY_SIZE*16
mov    esi, vec_array_b + ARRAY_SIZE*16
mov    eax, -16*ARRAY_SIZE

.loop:
  movaps xmm0, dqword [edi+eax]
  addps  xmm0, dqword [esi+eax]
  movaps dqword [edi+eax], xmm0

  add    eax, 16
  jnz    .loop

int3


align 16
; A new vec2 every two dwords (4 vec2 per array)
vec_array_a dd  1.0,  3.0,  5.0, 7.0, 9.0, 11.0, 13.0, 15.0
vec_array_b dd 15.0, 13.0, 11.0, 9.0, 7.0,  5.0,  3.0,  1.0    


Final state according to Olly:
Code:
00401030        16.00000       16.00000       16.00000       16.00000
00401040        16.00000       16.00000       16.00000       16.00000
00401050        15.00000       13.00000       11.00000       9.000000
00401060        7.000000       5.000000       3.000000       1.000000    

Code:
00401000 > BF 50104000      MOV EDI,SSE_vec2.00401050
00401005   BE 70104000      MOV ESI,SSE_vec2.00401070
0040100A   B8 E0FFFFFF      MOV EAX,-20
0040100F   0F280407         MOVAPS XMM0,DQWORD PTR DS:[EDI+EAX]
00401013   0F580406         ADDPS XMM0,DQWORD PTR DS:[ESI+EAX]
00401017   0F290407         MOVAPS DQWORD PTR DS:[EDI+EAX],XMM0
0040101B   83C0 10          ADD EAX,10
0040101E  ^75 EF            JNZ SHORT SSE_vec2.0040100F
00401020   CC               INT3    


Still, note that my loop is not optimized, appart of the obvious unrolling to remove the loop, on bigger arrays the unrolling factor has to be decided and also the instructions placement (something that I cannot help you much with...).

Benchmark this against the same thing implemented with FPU to see if you gain some speed improvement.
Post 15 Nov 2008, 22:22
View user's profile Send private message Reply with quote
adnimo



Joined: 18 Jul 2008
Posts: 49
adnimo
But this is the problem, I can't run this. yet the processor does have SSE, perhaps this functionality is only present on SSE2 onwards?

I have tried your way before, but I just couldn't make it run at all.
by the way I'm trying this on an x86 amd athlon-xp equivalent

ps: The code I posted does work just fine but it's suboptimal, at least here.

ps2: I can't really see a use of SIMD at this moment, I won't be having arrays of vectors at all -- you say that by processing by one vector there is nothing to be gained from SSE?
Post 15 Nov 2008, 22:35
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
My code as it was posted doesn't work for you?? (note that a crash is expected due to the int3 instruction, replace with ret if you are not willing to use this under OllyDbg or any other debugger)

Quote:
ps2: I can't really see a use of SIMD at this moment, I won't be having arrays of vectors at all -- you say that by processing by one vector there is nothing to be gained from SSE?


What you gonna have then?
Post 15 Nov 2008, 22:50
View user's profile Send private message Reply with quote
madmatt



Joined: 07 Oct 2003
Posts: 1045
Location: Michigan, USA
madmatt
There are many functions for vector math in the D3DX libraries, you don't have to use direct3d to use these functions.
Post 17 Nov 2008, 21:48
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2888
Location: [RSP+8*5]
bitRAKE
SSE4.1 has DPPD/DPPS. Passing around pointers isn't a good idea for such small functions - how about some macros for pseudo vector instructions?
Code:
macro vec4add reg,regmem { addps reg,regmem }
macro vec4sub reg,regmem { subps reg,regmem }
macro vec4dot reg,regmem { dpps reg,regmem,11110001b } ; SSE4.1
macro vec3dot reg,regmem { dpps reg,regmem,01110001b } ; SSE4.1
macro vec2dot reg,regmem { dpps reg,regmem,00110001b } ; SSE4.1
macro vec2dot_dbl reg,regmem { dppd reg,regmem,00110001b } ; SSE4.1    
Post 18 Nov 2008, 05:17
View user's profile Send private message Visit poster's website Reply with quote
madmatt



Joined: 07 Oct 2003
Posts: 1045
Location: Michigan, USA
madmatt
bitRAKE wrote:
SSE4.1 has DPPD/DPPS. Passing around pointers isn't a good idea for such small functions - how about some macros for pseudo vector instructions?
Code:
macro vec4add reg,regmem { addps reg,regmem }
macro vec4sub reg,regmem { subps reg,regmem }
macro vec4dot reg,regmem { dpps reg,regmem,11110001b } ; SSE4.1
macro vec3dot reg,regmem { dpps reg,regmem,01110001b } ; SSE4.1
macro vec2dot reg,regmem { dpps reg,regmem,00110001b } ; SSE4.1
macro vec2dot_dbl reg,regmem { dppd reg,regmem,00110001b } ; SSE4.1    


I don't think too many people have the sse 4.1 instruction set yet.

_________________
Gimme a sledge hammer! I'LL FIX IT!
Post 18 Nov 2008, 06:44
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.