flat assembler
Message board for the users of flat assembler.
Index
> Main > Is there an existent vector library for fasm? |
Author |
|
vid 14 Nov 2008, 07:51
Remember that you can use C libraries from Assembly, and such optimized vector libraries would be written in assembly anyway. It's just not likely that some existing library would contain headers for FASM (or, any other assembly syntax), so you would need to make those yourself, but that should be easy task.
|
|||
14 Nov 2008, 07:51 |
|
adnimo 14 Nov 2008, 07:53
vid, thanks for the pointer. which optimized vector library do you recommend I take a look at? (any on the open source side without many license limitations?)
|
|||
14 Nov 2008, 07:53 |
|
vid 14 Nov 2008, 09:11
I really can't help you with this, never did much vector stuff. If you are REALLY after optimization, it is probably better to write time-critical operations yourself (use MMX or SSE).
Otherwise I really don't know, but you could try to take a look at libtommath. It has a really nice "lowlevel" interface to be used from asm, and I think it has vector functions too (not sure). |
|||
14 Nov 2008, 09:11 |
|
bitshifter 14 Nov 2008, 13:50
There are a few libraries around but i have not seen one for fasm.
I am in the process of writing a 3d vector library for fasm. If you need some help on writing a 2d version i can assist. Here are a few tricks im using in my library to make it easy for someone to choose either 32 or 64 bit precision vectors. Also noting how much pressure is put on the fpu stack and if it is restored to its original state. (in some cases it is desirable to return a value on the fpu stack) Code: ;------------------------------------------------ ; 2d vector math library ;------------------------------------------------ ; for 32 bit precision VEC2_VALTYPE fix dd VEC2_VALSIZE fix dword VEC2_OFFSETY fix 4 ; for 64 bit precision ;VEC2_VALTYPE fix dq ;VEC2_VALSIZE fix qword ;VEC2_OFFSETY fix 8 struct Vec2 x VEC2_VALTYPE ? y VEC2_VALTYPE ? ends ;------------------------------------------------ ; Vec2 = Vec2 + Vec2 ; ; internal fpu stack pressure = 1 ; external fpu stack pressure = 0 ; restores fpu stack state = true ;------------------------------------------------ macro Vec2_SumVec2 res,opa,opb { fld VEC2_VALSIZE[opa] fadd VEC2_VALSIZE[opb] fstp VEC2_VALSIZE[res] fld VEC2_VALSIZE[opa+VEC2_OFFSETY] fadd VEC2_VALSIZE[opb+VEC2_OFFSETY] fstp VEC2_VALSIZE[res+VEC2_OFFSETY] } ;------------------------------------------------ ; Vec2 += Vec2 ; ; internal fpu stack pressure = 1 ; external fpu stack pressure = 0 ; restores fpu stack state = true ;------------------------------------------------ macro Vec2_AddVec2 dst,src { fld VEC2_VALSIZE[src] fadd VEC2_VALSIZE[dst] fstp VEC2_VALSIZE[dst] fld VEC2_VALSIZE[src+VEC2_OFFSETY] fadd VEC2_VALSIZE[dst+VEC2_OFFSETY] fstp VEC2_VALSIZE[dst+VEC2_OFFSETY] } |
|||
14 Nov 2008, 13:50 |
|
adnimo 15 Nov 2008, 18:42
Hmm this is strange, I began working on the vector library - I thought about doing the structure a vec4 and later on using SIMD to speed things up but I think it's the opposite on most cases where not a lot of operations are required on the structure fields, is this right?
for instance I'm doing movaps for vector 'a' and 'b' on xmm0 and 1, then I do addps onto xmm2 and I movaps back to vector a (that's the one I wanted the result at) --- doing the same but with the fpu I get better results (in blocks of fld, fld, faddp, fstp for each field). I ran a small benchmark and got the following results: 990ms vs 720ms ... fpu wins in this case, is this valid?, when or how should I use SIMD in this case? |
|||
15 Nov 2008, 18:42 |
|
LocoDelAssembly 15 Nov 2008, 20:12
Quote:
Could you post the code of this? Because I don't see how xmm2 is needed here. |
|||
15 Nov 2008, 20:12 |
|
adnimo 15 Nov 2008, 21:39
I tried just one movups and then addps with the address of the second vector but it crashes my application, that's why I had to use 2 movups and then add the registers
for what it's worth it this laptop only has SSE extensions, I don't know if it makes any difference. I also tried moving the pointer in EAX to XMM0 and I couldn't, this has to be possible but for some reason it's crashing. Code: movups xmm0, dqword[v_veca] movups xmm1, dqword[v_vecb] addps xmm0, xmm1 movups dqword[v_veca], xmm0 I mentioned 2 addps or the use of xmm2 before, my bad. (been goofing around with the code way too much) |
|||
15 Nov 2008, 21:39 |
|
LocoDelAssembly 15 Nov 2008, 22:22
You could do this:
Code: format pe gui 4.0 movaps xmm0, dqword [veca] addps xmm0, dqword [vecb] movaps dqword [veca], xmm0 int3 align 16 veca dd 1.0, 2.0, 0.0, 0.0 vecb dd 3.0, 4.0, 0.0, 0.0 Note that every memory access must be dqword aligned (16), the exception is movups but it is suboptimal. Also, note that we are wasting half of the processing power here by padding with zeroes the vectors, the idea here would be using an array of vectors to make the SSE much worth to be used. Code: format pe gui 4.0 ARRAY_SIZE = 2 ; 2 dqwords mov edi, vec_array_a + ARRAY_SIZE*16 mov esi, vec_array_b + ARRAY_SIZE*16 mov eax, -16*ARRAY_SIZE .loop: movaps xmm0, dqword [edi+eax] addps xmm0, dqword [esi+eax] movaps dqword [edi+eax], xmm0 add eax, 16 jnz .loop int3 align 16 ; A new vec2 every two dwords (4 vec2 per array) vec_array_a dd 1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0 vec_array_b dd 15.0, 13.0, 11.0, 9.0, 7.0, 5.0, 3.0, 1.0 Final state according to Olly: Code: 00401030 16.00000 16.00000 16.00000 16.00000 00401040 16.00000 16.00000 16.00000 16.00000 00401050 15.00000 13.00000 11.00000 9.000000 00401060 7.000000 5.000000 3.000000 1.000000 Code: 00401000 > BF 50104000 MOV EDI,SSE_vec2.00401050 00401005 BE 70104000 MOV ESI,SSE_vec2.00401070 0040100A B8 E0FFFFFF MOV EAX,-20 0040100F 0F280407 MOVAPS XMM0,DQWORD PTR DS:[EDI+EAX] 00401013 0F580406 ADDPS XMM0,DQWORD PTR DS:[ESI+EAX] 00401017 0F290407 MOVAPS DQWORD PTR DS:[EDI+EAX],XMM0 0040101B 83C0 10 ADD EAX,10 0040101E ^75 EF JNZ SHORT SSE_vec2.0040100F 00401020 CC INT3 Still, note that my loop is not optimized, appart of the obvious unrolling to remove the loop, on bigger arrays the unrolling factor has to be decided and also the instructions placement (something that I cannot help you much with...). Benchmark this against the same thing implemented with FPU to see if you gain some speed improvement. |
|||
15 Nov 2008, 22:22 |
|
adnimo 15 Nov 2008, 22:35
But this is the problem, I can't run this. yet the processor does have SSE, perhaps this functionality is only present on SSE2 onwards?
I have tried your way before, but I just couldn't make it run at all. by the way I'm trying this on an x86 amd athlon-xp equivalent ps: The code I posted does work just fine but it's suboptimal, at least here. ps2: I can't really see a use of SIMD at this moment, I won't be having arrays of vectors at all -- you say that by processing by one vector there is nothing to be gained from SSE? |
|||
15 Nov 2008, 22:35 |
|
LocoDelAssembly 15 Nov 2008, 22:50
My code as it was posted doesn't work for you?? (note that a crash is expected due to the int3 instruction, replace with ret if you are not willing to use this under OllyDbg or any other debugger)
Quote: ps2: I can't really see a use of SIMD at this moment, I won't be having arrays of vectors at all -- you say that by processing by one vector there is nothing to be gained from SSE? What you gonna have then? |
|||
15 Nov 2008, 22:50 |
|
madmatt 17 Nov 2008, 21:48
There are many functions for vector math in the D3DX libraries, you don't have to use direct3d to use these functions.
|
|||
17 Nov 2008, 21:48 |
|
bitRAKE 18 Nov 2008, 05:17
SSE4.1 has DPPD/DPPS. Passing around pointers isn't a good idea for such small functions - how about some macros for pseudo vector instructions?
Code: macro vec4add reg,regmem { addps reg,regmem } macro vec4sub reg,regmem { subps reg,regmem } macro vec4dot reg,regmem { dpps reg,regmem,11110001b } ; SSE4.1 macro vec3dot reg,regmem { dpps reg,regmem,01110001b } ; SSE4.1 macro vec2dot reg,regmem { dpps reg,regmem,00110001b } ; SSE4.1 macro vec2dot_dbl reg,regmem { dppd reg,regmem,00110001b } ; SSE4.1 |
|||
18 Nov 2008, 05:17 |
|
madmatt 18 Nov 2008, 06:44
bitRAKE wrote: SSE4.1 has DPPD/DPPS. Passing around pointers isn't a good idea for such small functions - how about some macros for pseudo vector instructions? I don't think too many people have the sse 4.1 instruction set yet. _________________ Gimme a sledge hammer! I'LL FIX IT! |
|||
18 Nov 2008, 06:44 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.