flat assembler
Message board for the users of flat assembler.
Index
> Main > How to fill xmmx with 4 singles: x^8, x^6, x^4, x^2 
Author 

mattst88 16 Dec 2006, 03:06
This is for calculating cosine using the Taylor series. I can easily load 1/4! etc from a LUT shamelessly ripped from r22.
Any suggestions how to load x^8, x^6, x^4, and x^2 into an xmm register in any order? _________________ My x86 Instruction Reference  includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions. Assembly Programmer's Journal 

16 Dec 2006, 03:06 

Tomasz Grysztar 16 Dec 2006, 18:12
OK, my quick try:
Code: movss xmm0,[x] ; x # # # mulss xmm0,xmm0 ; x^2 # # # movss xmm1,xmm0 shufps xmm0,xmm0,00111000b ; x^2 # # x^2 mulss xmm0,xmm1 ; x^4 # # x^2 shufps xmm0,xmm0,00111000b ; x^4 # x^2 x^4 mulss xmm0,xmm1 ; x^6 # x^2 x^4 shufps xmm0,xmm0,00111000b ; x^6 x^2 x^4 x^6 mulss xmm0,xmm1 ; x^8 x^2 x^4 x^6 shufps xmm0,xmm0,00111001b ; x^2 x^4 x^6 x^8 Note that three pairs of instructions are identical. 

16 Dec 2006, 18:12 

mattst88 16 Dec 2006, 18:50
That's what I was thinking, but I was hoping there was a better way than multiplying and shifting. Thanks for responding


16 Dec 2006, 18:50 

Tomasz Grysztar 16 Dec 2006, 19:23
Let's hope someone rises to the challenge and shows with a faster version.


16 Dec 2006, 19:23 

mattst88 16 Dec 2006, 19:40
Let's put this problem in context. Here's the cosine function I've come up with:
Code: cosine: movss xmm0,[x] ; x # # # mulss xmm0,xmm0 ; x^2 # # # movss xmm1,xmm0 shufps xmm0,xmm0,00111000b ; x^2 # # x^2 mulss xmm0,xmm1 ; x^4 # # x^2 shufps xmm0,xmm0,00111000b ; x^4 # x^2 x^4 mulss xmm0,xmm1 ; x^6 # x^2 x^4 shufps xmm0,xmm0,00111000b ; x^6 x^2 x^4 x^6 mulss xmm0,xmm1 ; x^8 x^2 x^4 x^6 ; shufps xmm0,xmm0,00111001b ; x^2 x^4 x^6 x^8 movaps xmm1,dqword[CosTable] mulps xmm0,xmm1 haddps xmm0,xmm0 haddps xmm0,xmm0 movss xmm2,[one] addss xmm2,xmm0 ret align 16 CosTable dd 2.4801587301587301587301587301587e5,0.5 dd 0.041666666666666666666666666666667,0.0013888888888888888888888888888889 x dd 3.1415926535 one dd 1.0 What optimizations can I make to the total algorithm? 

16 Dec 2006, 19:40 

kohlrak 16 Dec 2006, 19:58
I don't think it's possible to make direct binary manipulation (aka doing calculations with the binary operaters rather than numerical operators) any faster. I myself would like to see some one try and tumble with that one. lol


16 Dec 2006, 19:58 

Goplat 16 Dec 2006, 20:52
This way seems to be a bit faster.
Code: movss xmm0,[x] ;x 0 0 0 mulss xmm0,xmm0 ;x^2 0 0 0 movss xmm1,[one] ;1 0 0 0 movlhps xmm1,xmm0 ;1 0 x^2 0 mulss xmm0,xmm0 ;x^4 0 0 0 movsldup xmm1,xmm1 ;1 1 x^2 x^2 movss xmm1,xmm0 ;x^4 1 x^2 x^2 shufps xmm0,xmm1,00010000b ;x^4 x^4 1 x^4 mulps xmm0,xmm1 ;x^8 x^4 x^2 x^6 Code: fld [x] fmul st,st0 fld [c8] fmul st,st1 fsub [c6] fmul st,st1 fadd [c4] fmul st,st1 fsub [c2] fmulp st1,st fadd [one] ... align 16 c8 dd 2.4801587301587301587301587301587e5 c6 dd 0.0013888888888888888888888888888889 c4 dd 0.041666666666666666666666666666667 c2 dd 0.5 

16 Dec 2006, 20:52 

r22 16 Dec 2006, 22:27
Because of the dependencies in the algorithm both ways SSE or FPU will have a lot of stalls during the execution. Interleaving the instructions can help a little.
So for taking an array of theta's and calculating the sin and/or cos for them and store them in another array, you can really take advantage of the parallelism of the SIMD instructions and make use of all the XMMX registers. Goplat your FPU code doesn't look like it's equivalent to the SSE version, so I don't quite understand your comparison. 

16 Dec 2006, 22:27 

Goplat 16 Dec 2006, 23:11
r22: It is equivalent, just a different approach. I went back to the original formula of cos(x) ~ 1  x^2/2! + x^4/4!  x^6/6! + x^8/8!
Code: fld [c8] ;1/8! fmul st,st1 ;x^2/8! fsub [c6] ;1/6! + x^2/8! fmul st,st1 ;x^2/6! + x^4/8! fadd [c4] ;1/4!  x^2/6! + x^4/8! fmul st,st1 ;x^2/4!  x^4/6! + x^6/8! fsub [c2] ;1/2! + x^2/4!  x^4/6! + x^6/8! fmulp st1,st ;x^2/2! + x^4/4!  x^6/6! + x^8/8! fadd [one] ;1  x^2/2! + x^4/4!  x^6/6! + x^8/8! 

16 Dec 2006, 23:11 

r22 17 Dec 2006, 19:26
I'm sorry for not elaborating in my last post.
Because you changed the algorithm around, it's not comparible (Like comparing a bubble sort to an insertion sort, but yield the correct results but in totally different ways), unless we rework the SSE code to use the same distributive multiplication method that you used in the FPU code. If this change you've proposed works well in an SSE translation you may have found a much more efficient way to do parallel calculation of the trig functions (or any function that can be estimated with a summed series). Allowing the distributive law to take care of the increasing exponents, is something I didn't even think of when I first set out to code an SSE taylor series estimate of cosine. Very good lesson in how code level optimization should occur AFTER algorithm analysis. 

17 Dec 2006, 19:26 

kohlrak 17 Dec 2006, 19:34
r22 wrote: I'm sorry for not elaborating in my last post. Look at it like a redneck. The idea is to get the same answer, faster or with less code. If you get the same answer with a faster code, use it. Now if the idea is to write a code to use those registers and/or instructions as examples, then you would worry about not using this code, but as long as you get the right result in a faster manor and/or with less code, that's what you do, right? 

17 Dec 2006, 19:34 

mattst88 17 Dec 2006, 20:04
I could be crazy, but Goplat's code doesn't look like it works. Maybe I just don't understand it.
Edit: It works great. It was just over my head. That's actually an excellent piece of code. Awesome work Goplat Last edited by mattst88 on 17 Dec 2006, 23:40; edited 1 time in total 

17 Dec 2006, 20:04 

kohlrak 17 Dec 2006, 20:06
Well, i really don't know what it's supposed to do, but if i did i'd test it and we could easily find out. lol


17 Dec 2006, 20:06 

MCD 28 Dec 2006, 17:30
okay, my try of the original problem, I got 3 different versions:
Code: ;1.: movss xmm0,[x] orps xmm0,[_1_1_1_0] mulss xmm0,xmm0 movaps xmm1,xmm0 shufps xmm1,xmm1,11100000b movaps xmm2,xmm1 shufps xmm2,xmm2,01000100b mulss xmm0,xmm1 mulps xmm0,xmm2 ;2: ;maybe punpckldq / qdq got some int/float switching overhead movss xmm0,[x] orps xmm0,[_1_1_1_0] mulss xmm0,xmm0 movaps xmm1,xmm0 punpckldq xmm1,xmm1 movaps xmm2,xmm1 punpcklqdq xmm2,xmm2 mulss xmm0,xmm1 mulps xmm0,xmm2 ;3: ;this requires SSE3, but shortest/fastest one movss xmm0,[x] orps xmm0,[_1_1_1_0] mulss xmm0,xmm0 movsldup xmm1,xmm0 movddup xmm2,xmm1 mulss xmm0,xmm1 mulps xmm0,xmm2 unfortunately, I got no time testing those _________________ MCD  the inevitable return of the Mad Computer Doggy __/ .+~ .  

28 Dec 2006, 17:30 

< Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992023, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.