flat assembler
Message board for the users of flat assembler.
Index
> Main > .. 
Author 

pool 18 Dec 2009, 12:06
..
Last edited by pool on 17 Mar 2013, 12:06; edited 1 time in total 

18 Dec 2009, 12:06 

madmatt 20 Dec 2009, 09:53
Maybe this should be moved to the 'MAIN" forum.


20 Dec 2009, 09:53 

bitshifter 20 Dec 2009, 17:44
I have a small 4x4 matrix library that Madis and i have been developing...
If 3D transformations with SSE interest you then i can post some nice code 

20 Dec 2009, 17:44 

Borsuc 20 Dec 2009, 20:23
do you have 3x3 matrix instead? I would be interested in that. I prefer less overhead and to do translations more optimized.


20 Dec 2009, 20:23 

bitshifter 20 Dec 2009, 22:21
Using the whole 128bit register saves me from shuffling all the time.
More overhead, yes, faster, yes. Ok, here is a small example, first in C, then in SSE2 Code: void TranslateMatrix_XYZ(float m[16], const float v[3]) { m[12] += m[0] * v[0] + m[4] * v[1] + m[8] * v[2]; m[13] += m[1] * v[0] + m[5] * v[1] + m[9] * v[2]; m[14] += m[2] * v[0] + m[6] * v[1] + m[10] * v[2]; m[15] += m[3] * v[0] + m[7] * v[1] + m[11] * v[2]; } Code: ; g_translation is vector [X,Y,Z,W=1.0] ; data aligned on 16 byte bounadary movaps xmm0,dqword[g_translation] movaps xmm1,dqword[g_matrix] movaps xmm2,dqword[g_matrix+16] movaps xmm3,dqword[g_matrix+32] movaps xmm4,dqword[g_matrix+48] pshufd xmm5,xmm0,00000000b pshufd xmm6,xmm0,01010101b pshufd xmm7,xmm0,10101010b mulps xmm1,xmm5 mulps xmm2,xmm6 mulps xmm3,xmm7 addps xmm4,xmm1 addps xmm4,xmm2 addps xmm4,xmm3 movaps dqword[g_matrix+48],xmm4 I also have these routines already implemented and tested... CopyMatrix IdentityMatrix TransposeMatrix RotateMatrix_X RotateMatrix_Y RotateMatrix_Z RotateMatrix_XYZ ScaleMatrix_X ScaleMatrix_Y ScaleMatrix_Z ScaleMatrix_XYZ TranslateMatrix_X TranslateMatrix_Y TranslateMatrix_Z TranslateMatrix_XYZ MultiplyMatrices TransformVector 

20 Dec 2009, 22:21 

Borsuc 20 Dec 2009, 23:27
hmm I thought you were doing 4 transformations at a time (so any matrix size would be suitable), not the whole matrix at a time. Actually now that I think of it, a 3x4 matrix would be even better with that approach.
(i.e you transform 4 vectors at a time) thanks for the code, i'm a noob in SSE and it proves helpful _________________ Previously known as The_Grey_Beast 

20 Dec 2009, 23:27 

bitshifter 20 Dec 2009, 23:45
For transforming vectors i setup the transformation matrix once,
then batch process all my vertices through it, its nice and fast that way. Now dont confuse matrix transformations with vector transformations... To transform a vector would be like... Code: void TransformVector(float v[3], const float m[16]) { const float x = v[0]; const float y = v[1]; const float z = v[2]; v[0] = m[0] * x + m[4] * y + m[8] * z + m[12]; v[1] = m[1] * x + m[5] * y + m[9] * z + m[13]; v[2] = m[2] * x + m[6] * y + m[10] * z + m[14]; } Code: ; g_vector is dqword [X,Y,Z,W=1.0] ; All data aligned on 16 byte boundary movaps xmm0,dqword[g_vector] movaps xmm1,dqword[g_matrix] movaps xmm2,dqword[g_matrix+16] movaps xmm3,dqword[g_matrix+32] movaps xmm4,dqword[g_matrix+48] pshufd xmm5,xmm0,00000000b pshufd xmm6,xmm0,01010101b pshufd xmm7,xmm0,10101010b mulps xmm1,xmm5 mulps xmm2,xmm6 mulps xmm3,xmm7 addps xmm4,xmm1 addps xmm4,xmm2 addps xmm4,xmm3 movaps dqword[g_vector],xmm4 We only need 3 dot products but we get 4 and trash the vector.w in the process So you would need to preserve it somehow.... Also i would like to thank Madis for tutoring me with SIMD optimizations. Sometimes i do good and produce very fast code, other times he kicks my butt. We usually ponder a routine for a while and see who can do the best. And of course, someone else may spot ways to do things better, as always... And if you want to see some mind blowing 3x concatenated rotations, just ask... 

20 Dec 2009, 23:45 

Borsuc 20 Dec 2009, 23:51
Yes well transforming vectors is the slow process  you have to do it for each vertex, and that's the slow part, matrix is usually done once per object.
Sorry for noob question but can't you use SSE in this way to not waste anything at all: put x1,x2,x3,x4 in the registers and then do it with the same algorithms like a normal (nonSSE) vector transformation... but you do 4 vectors at a time as you can see. isn't that possible, or does it require lots of shuffling? (I'm of course also talking about storing the vertices like that in memory, not as "x1,y1,z1;x2,y2,z2" which would probably need shuffling). bitshifter wrote: And if you want to see some mind blowing 3x concatenated rotations, just ask... _________________ Previously known as The_Grey_Beast 

20 Dec 2009, 23:51 

bitshifter 21 Dec 2009, 00:01
Yes, by cashing the matrix and pumping all the vertices through it.
And yes, concatenation in 3 dimensions all in one shot. Note: The order of multiple concatenated rotations produces different results as expected. _________________ Coding a 3D game engine with fasm is like trying to eat an elephant, you just have to keep focused and take it one 'byte' at a time. 

21 Dec 2009, 00:01 

< Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.