flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
mattst88 16 Dec 2006, 03:06
This is for calculating cosine using the Taylor series. I can easily load 1/4! etc from a LUT shamelessly ripped from r22.
Any suggestions how to load x^8, x^6, x^4, and x^2 into an xmm register in any order? _________________ My x86 Instruction Reference -- includes SSE, SSE2, SSE3, SSSE3, SSE4 instructions. Assembly Programmer's Journal |
|||
![]() |
|
Tomasz Grysztar 16 Dec 2006, 18:12
OK, my quick try:
Code: movss xmm0,[x] ; x # # # mulss xmm0,xmm0 ; x^2 # # # movss xmm1,xmm0 shufps xmm0,xmm0,00111000b ; x^2 # # x^2 mulss xmm0,xmm1 ; x^4 # # x^2 shufps xmm0,xmm0,00111000b ; x^4 # x^2 x^4 mulss xmm0,xmm1 ; x^6 # x^2 x^4 shufps xmm0,xmm0,00111000b ; x^6 x^2 x^4 x^6 mulss xmm0,xmm1 ; x^8 x^2 x^4 x^6 shufps xmm0,xmm0,00111001b ; x^2 x^4 x^6 x^8 Note that three pairs of instructions are identical. |
|||
![]() |
|
mattst88 16 Dec 2006, 18:50
That's what I was thinking, but I was hoping there was a better way than multiplying and shifting. Thanks for responding
![]() |
|||
![]() |
|
Tomasz Grysztar 16 Dec 2006, 19:23
Let's hope someone rises to the challenge and shows with a faster version.
![]() |
|||
![]() |
|
mattst88 16 Dec 2006, 19:40
Let's put this problem in context. Here's the cosine function I've come up with:
Code: cosine: movss xmm0,[x] ; x # # # mulss xmm0,xmm0 ; x^2 # # # movss xmm1,xmm0 shufps xmm0,xmm0,00111000b ; x^2 # # x^2 mulss xmm0,xmm1 ; x^4 # # x^2 shufps xmm0,xmm0,00111000b ; x^4 # x^2 x^4 mulss xmm0,xmm1 ; x^6 # x^2 x^4 shufps xmm0,xmm0,00111000b ; x^6 x^2 x^4 x^6 mulss xmm0,xmm1 ; x^8 x^2 x^4 x^6 ; shufps xmm0,xmm0,00111001b ; x^2 x^4 x^6 x^8 movaps xmm1,dqword[CosTable] mulps xmm0,xmm1 haddps xmm0,xmm0 haddps xmm0,xmm0 movss xmm2,[one] addss xmm2,xmm0 ret align 16 CosTable dd 2.4801587301587301587301587301587e-5,-0.5 dd 0.041666666666666666666666666666667,-0.0013888888888888888888888888888889 x dd 3.1415926535 one dd 1.0 What optimizations can I make to the total algorithm? |
|||
![]() |
|
kohlrak 16 Dec 2006, 19:58
I don't think it's possible to make direct binary manipulation (aka doing calculations with the binary operaters rather than numerical operators) any faster. I myself would like to see some one try and tumble with that one. lol
|
|||
![]() |
|
Goplat 16 Dec 2006, 20:52
This way seems to be a bit faster.
Code: movss xmm0,[x] ;x 0 0 0 mulss xmm0,xmm0 ;x^2 0 0 0 movss xmm1,[one] ;1 0 0 0 movlhps xmm1,xmm0 ;1 0 x^2 0 mulss xmm0,xmm0 ;x^4 0 0 0 movsldup xmm1,xmm1 ;1 1 x^2 x^2 movss xmm1,xmm0 ;x^4 1 x^2 x^2 shufps xmm0,xmm1,00010000b ;x^4 x^4 1 x^4 mulps xmm0,xmm1 ;x^8 x^4 x^2 x^6 Code: fld [x] fmul st,st0 fld [c8] fmul st,st1 fsub [c6] fmul st,st1 fadd [c4] fmul st,st1 fsub [c2] fmulp st1,st fadd [one] ... align 16 c8 dd 2.4801587301587301587301587301587e-5 c6 dd 0.0013888888888888888888888888888889 c4 dd 0.041666666666666666666666666666667 c2 dd 0.5 |
|||
![]() |
|
r22 16 Dec 2006, 22:27
Because of the dependencies in the algorithm both ways SSE or FPU will have a lot of stalls during the execution. Interleaving the instructions can help a little.
So for taking an array of theta's and calculating the sin and/or cos for them and store them in another array, you can really take advantage of the parallelism of the SIMD instructions and make use of all the XMMX registers. Goplat your FPU code doesn't look like it's equivalent to the SSE version, so I don't quite understand your comparison. |
|||
![]() |
|
Goplat 16 Dec 2006, 23:11
r22: It is equivalent, just a different approach. I went back to the original formula of cos(x) ~ 1 - x^2/2! + x^4/4! - x^6/6! + x^8/8!
Code: fld [c8] ;1/8! fmul st,st1 ;x^2/8! fsub [c6] ;-1/6! + x^2/8! fmul st,st1 ;-x^2/6! + x^4/8! fadd [c4] ;1/4! - x^2/6! + x^4/8! fmul st,st1 ;x^2/4! - x^4/6! + x^6/8! fsub [c2] ;-1/2! + x^2/4! - x^4/6! + x^6/8! fmulp st1,st ;-x^2/2! + x^4/4! - x^6/6! + x^8/8! fadd [one] ;1 - x^2/2! + x^4/4! - x^6/6! + x^8/8! |
|||
![]() |
|
r22 17 Dec 2006, 19:26
I'm sorry for not elaborating in my last post.
Because you changed the algorithm around, it's not comparible (Like comparing a bubble sort to an insertion sort, but yield the correct results but in totally different ways), unless we rework the SSE code to use the same distributive multiplication method that you used in the FPU code. If this change you've proposed works well in an SSE translation you may have found a much more efficient way to do parallel calculation of the trig functions (or any function that can be estimated with a summed series). Allowing the distributive law to take care of the increasing exponents, is something I didn't even think of when I first set out to code an SSE taylor series estimate of cosine. Very good lesson in how code level optimization should occur AFTER algorithm analysis. |
|||
![]() |
|
kohlrak 17 Dec 2006, 19:34
r22 wrote: I'm sorry for not elaborating in my last post. Look at it like a redneck. The idea is to get the same answer, faster or with less code. If you get the same answer with a faster code, use it. Now if the idea is to write a code to use those registers and/or instructions as examples, then you would worry about not using this code, but as long as you get the right result in a faster manor and/or with less code, that's what you do, right? |
|||
![]() |
|
mattst88 17 Dec 2006, 20:04
I could be crazy, but Goplat's code doesn't look like it works. Maybe I just don't understand it.
Edit: It works great. It was just over my head. That's actually an excellent piece of code. Awesome work Goplat ![]() Last edited by mattst88 on 17 Dec 2006, 23:40; edited 1 time in total |
|||
![]() |
|
kohlrak 17 Dec 2006, 20:06
Well, i really don't know what it's supposed to do, but if i did i'd test it and we could easily find out. lol
|
|||
![]() |
|
MCD 28 Dec 2006, 17:30
okay, my try of the original problem, I got 3 different versions:
Code: ;1.: movss xmm0,[x] orps xmm0,[_1_1_1_0] mulss xmm0,xmm0 movaps xmm1,xmm0 shufps xmm1,xmm1,11100000b movaps xmm2,xmm1 shufps xmm2,xmm2,01000100b mulss xmm0,xmm1 mulps xmm0,xmm2 ;2: ;maybe punpckldq / qdq got some int/float switching overhead movss xmm0,[x] orps xmm0,[_1_1_1_0] mulss xmm0,xmm0 movaps xmm1,xmm0 punpckldq xmm1,xmm1 movaps xmm2,xmm1 punpcklqdq xmm2,xmm2 mulss xmm0,xmm1 mulps xmm0,xmm2 ;3: ;this requires SSE3, but shortest/fastest one movss xmm0,[x] orps xmm0,[_1_1_1_0] mulss xmm0,xmm0 movsldup xmm1,xmm0 movddup xmm2,xmm1 mulss xmm0,xmm1 mulps xmm0,xmm2 unfortunately, I got no time testing those _________________ MCD - the inevitable return of the Mad Computer Doggy -||__/ .|+-~ .|| || |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.