How to fill xmmx with 4 singles: x^8, x^6, x^4, x^2

Index > Main > How to fill xmmx with 4 singles: x^8, x^6, x^4, x^2

Author

Thread

mattst88

Joined: 12 May 2006
Posts: 260

mattst88 16 Dec 2006, 03:06

This is for calculating cosine using the Taylor series. I can easily load 1/4! etc from a LUT shamelessly ripped from r22.

Any suggestions how to load x^8, x^6, x^4, and x^2 into an xmm register in any order?

_________________
Assembly Programmer's Journal

16 Dec 2006, 03:06

mattst88

Joined: 12 May 2006
Posts: 260

mattst88 16 Dec 2006, 17:46

I assume after 120 views and no replies this problem isn't simple?

Let me restate the problem:

I want to start with a single-precision floating-point value in [31:0] of xmm0, and with that calculate x^8, x^6, x^4, x^2 and store those in an SSE register.

16 Dec 2006, 17:46

Tomasz Grysztar

Joined: 16 Jun 2003
Posts: 8534
Location: Kraków, Poland

Tomasz Grysztar 16 Dec 2006, 18:12

OK, my quick try:

Code:

        movss  xmm0,[x]            ; x # # #

        mulss  xmm0,xmm0           ; x^2 # # #
        movss  xmm1,xmm0

        shufps xmm0,xmm0,00111000b ; x^2 # # x^2
        mulss  xmm0,xmm1           ; x^4 # # x^2

        shufps xmm0,xmm0,00111000b ; x^4 # x^2 x^4
        mulss  xmm0,xmm1           ; x^6 # x^2 x^4

        shufps xmm0,xmm0,00111000b ; x^6 x^2 x^4 x^6
        mulss  xmm0,xmm1           ; x^8 x^2 x^4 x^6

        shufps xmm0,xmm0,00111001b ; x^2 x^4 x^6 x^8

Note that three pairs of instructions are identical.

16 Dec 2006, 18:12

mattst88

Joined: 12 May 2006
Posts: 260

mattst88 16 Dec 2006, 18:50

That's what I was thinking, but I was hoping there was a better way than multiplying and shifting. Thanks for responding Smile

16 Dec 2006, 18:50

Tomasz Grysztar

Joined: 16 Jun 2003
Posts: 8534
Location: Kraków, Poland

Tomasz Grysztar 16 Dec 2006, 19:23

Let's hope someone rises to the challenge and shows with a faster version. Wink

16 Dec 2006, 19:23

mattst88

Joined: 12 May 2006
Posts: 260

mattst88 16 Dec 2006, 19:40

Let's put this problem in context. Here's the cosine function I've come up with:

Code:

cosine:
        movss   xmm0,[x]    ; x # # #
        mulss   xmm0,xmm0       ; x^2 # # #
        movss   xmm1,xmm0
        shufps  xmm0,xmm0,00111000b     ; x^2 # # x^2
        mulss   xmm0,xmm1               ; x^4 # # x^2
        shufps  xmm0,xmm0,00111000b     ; x^4 # x^2 x^4
        mulss   xmm0,xmm1               ; x^6 # x^2 x^4
        shufps  xmm0,xmm0,00111000b     ; x^6 x^2 x^4 x^6
        mulss   xmm0,xmm1               ; x^8 x^2 x^4 x^6
        ; shufps xmm0,xmm0,00111001b    ; x^2 x^4 x^6 x^8
        movaps  xmm1,dqword[CosTable]
        mulps   xmm0,xmm1
        haddps  xmm0,xmm0
        haddps  xmm0,xmm0
        movss   xmm2,[one]
        addss   xmm2,xmm0
        ret

align 16
CosTable dd 2.4801587301587301587301587301587e-5,-0.5
         dd 0.041666666666666666666666666666667,-0.0013888888888888888888888888888889
x dd 3.1415926535
one dd 1.0

What optimizations can I make to the total algorithm?

16 Dec 2006, 19:40

kohlrak

Joined: 21 Jul 2006
Posts: 1413
Location: Uncle Sam's Pad

kohlrak 16 Dec 2006, 19:58

I don't think it's possible to make direct binary manipulation (aka doing calculations with the binary operaters rather than numerical operators) any faster. I myself would like to see some one try and tumble with that one. lol

16 Dec 2006, 19:58

Goplat

Joined: 15 Sep 2006
Posts: 181

Goplat 16 Dec 2006, 20:52

This way seems to be a bit faster.

Code:

movss    xmm0,[x]               ;x      0       0       0
mulss    xmm0,xmm0              ;x^2    0       0       0
movss    xmm1,[one]             ;1      0       0       0
movlhps  xmm1,xmm0              ;1      0       x^2     0
mulss    xmm0,xmm0              ;x^4    0       0       0
movsldup xmm1,xmm1              ;1      1       x^2     x^2
movss    xmm1,xmm0              ;x^4    1       x^2     x^2
shufps   xmm0,xmm1,00010000b    ;x^4    x^4     1       x^4
mulps    xmm0,xmm1              ;x^8    x^4     x^2     x^6

Edit: I think this algorithm may be faster without the parallelism. I tried this simple FPU code and it came out to be only 26 cycles compared to the original's 45 and my above version's 37.

Code:

        fld     [x]
        fmul    st,st0
        fld     [c8]
        fmul    st,st1
        fsub    [c6]
        fmul    st,st1
        fadd    [c4]
        fmul    st,st1
        fsub    [c2]
        fmulp   st1,st
        fadd    [one]
...
align 16
c8      dd 2.4801587301587301587301587301587e-5
c6      dd 0.0013888888888888888888888888888889
c4      dd 0.041666666666666666666666666666667
c2      dd 0.5

16 Dec 2006, 20:52

r22

Joined: 27 Dec 2004
Posts: 805

r22 16 Dec 2006, 22:27

Because of the dependencies in the algorithm both ways SSE or FPU will have a lot of stalls during the execution. Interleaving the instructions can help a little.

So for taking an array of theta's and calculating the sin and/or cos for them and store them in another array, you can really take advantage of the parallelism of the SIMD instructions and make use of all the XMMX registers.

Goplat your FPU code doesn't look like it's equivalent to the SSE version, so I don't quite understand your comparison.

16 Dec 2006, 22:27

Goplat

Joined: 15 Sep 2006
Posts: 181

Goplat 16 Dec 2006, 23:11

r22: It is equivalent, just a different approach. I went back to the original formula of cos(x) ~ 1 - x^2/2! + x^4/4! - x^6/6! + x^8/8!

Code:

fld     [c8]    ;1/8!
fmul    st,st1  ;x^2/8!
fsub    [c6]    ;-1/6! + x^2/8!
fmul    st,st1  ;-x^2/6! + x^4/8!
fadd    [c4]    ;1/4! - x^2/6! + x^4/8!
fmul    st,st1  ;x^2/4! - x^4/6! + x^6/8!
fsub    [c2]    ;-1/2! + x^2/4! - x^4/6! + x^6/8!
fmulp   st1,st  ;-x^2/2! + x^4/4! - x^6/6! + x^8/8!
fadd    [one]   ;1 - x^2/2! + x^4/4! - x^6/6! + x^8/8!

16 Dec 2006, 23:11

r22

Joined: 27 Dec 2004
Posts: 805

r22 17 Dec 2006, 19:26

I'm sorry for not elaborating in my last post.

Because you changed the algorithm around, it's not comparible (Like comparing a bubble sort to an insertion sort, but yield the correct results but in totally different ways), unless we rework the SSE code to use the same distributive multiplication method that you used in the FPU code.

If this change you've proposed works well in an SSE translation you may have found a much more efficient way to do parallel calculation of the trig functions (or any function that can be estimated with a summed series). Allowing the distributive law to take care of the increasing exponents, is something I didn't even think of when I first set out to code an SSE taylor series estimate of cosine.

Very good lesson in how code level optimization should occur AFTER algorithm analysis.

17 Dec 2006, 19:26

kohlrak

Joined: 21 Jul 2006
Posts: 1413
Location: Uncle Sam's Pad

kohlrak 17 Dec 2006, 19:34

r22 wrote:

I'm sorry for not elaborating in my last post.

Because you changed the algorithm around, it's not comparible (Like comparing a bubble sort to an insertion sort, but yield the correct results but in totally different ways), unless we rework the SSE code to use the same distributive multiplication method that you used in the FPU code.

If this change you've proposed works well in an SSE translation you may have found a much more efficient way to do parallel calculation of the trig functions (or any function that can be estimated with a summed series). Allowing the distributive law to take care of the increasing exponents, is something I didn't even think of when I first set out to code an SSE taylor series estimate of cosine.

Very good lesson in how code level optimization should occur AFTER algorithm analysis.

Look at it like a redneck. The idea is to get the same answer, faster or with less code. If you get the same answer with a faster code, use it. Now if the idea is to write a code to use those registers and/or instructions as examples, then you would worry about not using this code, but as long as you get the right result in a faster manor and/or with less code, that's what you do, right?

17 Dec 2006, 19:34

mattst88

Joined: 12 May 2006
Posts: 260

mattst88 17 Dec 2006, 20:04

I could be crazy, but Goplat's code doesn't look like it works. Maybe I just don't understand it.

Edit: It works great. It was just over my head. That's actually an excellent piece of code. Awesome work Goplat Smile

Last edited by mattst88 on 17 Dec 2006, 23:40; edited 1 time in total

17 Dec 2006, 20:04

kohlrak

Joined: 21 Jul 2006
Posts: 1413
Location: Uncle Sam's Pad

kohlrak 17 Dec 2006, 20:06

Well, i really don't know what it's supposed to do, but if i did i'd test it and we could easily find out. lol

17 Dec 2006, 20:06

MCD

Joined: 21 Aug 2004
Posts: 602
Location: Germany

MCD 28 Dec 2006, 17:30

okay, my try of the original problem, I got 3 different versions:

Code:

;1.:
        movss   xmm0,[x]
        orps    xmm0,[_1_1_1_0]
        mulss   xmm0,xmm0
        movaps  xmm1,xmm0
        shufps  xmm1,xmm1,11100000b
        movaps  xmm2,xmm1
        shufps  xmm2,xmm2,01000100b
        mulss   xmm0,xmm1
        mulps   xmm0,xmm2

;2:
;maybe punpckldq / qdq got some int/float switching overhead
        movss   xmm0,[x]
        orps    xmm0,[_1_1_1_0]
        mulss   xmm0,xmm0
        movaps  xmm1,xmm0
        punpckldq xmm1,xmm1
        movaps  xmm2,xmm1
        punpcklqdq xmm2,xmm2
        mulss   xmm0,xmm1
        mulps   xmm0,xmm2

;3:
;this requires SSE3, but shortest/fastest one
        movss   xmm0,[x]
        orps    xmm0,[_1_1_1_0]
        mulss   xmm0,xmm0
        movsldup xmm1,xmm0
        movddup xmm2,xmm1
        mulss   xmm0,xmm1
        mulps   xmm0,xmm2

unfortunately, I got no time testing those

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||

28 Dec 2006, 17:30

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum