how do x^6 on sse or avx ?

Index > Windows > how do x^6 on sse or avx ?

Goto page Previous 1, 2

Author

Thread

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20753
Location: In your JS exploiting you and your system

revolution 20 Oct 2025, 08:35

It is possible to use fewer multiplies if the code is permitted to include DIV.

For example n=127 can be done with 7 multiplies and one division. Compared to 12 multiplies if DIV is not used. This can be a win if DIV costs less than 5 multiplies.

20 Oct 2025, 08:35

macomics

Joined: 26 Jan 2021
Posts: 1197
Location: Russia

macomics 20 Oct 2025, 08:55

I think there is much more to be won by not using a conditional loop, but by writing the hardcode of the desired equation using commands and the mathematical form of the calculation in place.

20 Oct 2025, 08:55

macomics

Joined: 26 Jan 2021
Posts: 1197
Location: Russia

macomics 20 Oct 2025, 09:32

Let the assembler generate the desired sequence of mulss without cycles and conditions, depending on the specific exponent.

Code:

macro int_power_inline p, x, n { local t
  if n = 0
    movss xmm0, [const1]
    movss p, xmm0
  else if n = 1
    movss xmm0, x
    movss p, xmm0
  else if n = -1
    movss xmm0, [const1]
    movss xmm1, x
    divss xmm0, xmm1
    movss p, xmm0
  else
    if n < 0
      t = 0 - n
    else
      t = n
    end if
    movss xmm0, [const1]
    movss xmm1, x
    while t > 1
      if t and 1
        mulss xmm0, xmm1
      end if
      mulss xmm1, xmm1
      t = t shr 1
    end while
    mulss xmm0, xmm1
    if n < 0
      movss xmm1, [const1]
      divss xmm1, xmm0
      movss p, xmm1
    else
      movss p, xmm0
    end if
  end if
}

20 Oct 2025, 09:32

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20753
Location: In your JS exploiting you and your system

revolution 20 Oct 2025, 13:31

macomics wrote:

Let the assembler generate the desired sequence of mulss without cycles and conditions, depending on the specific exponent.
<snip>

The code posted generates one more multiply than necessary for n>1

20 Oct 2025, 13:31

Roman

Joined: 21 Apr 2012
Posts: 2016

Roman 20 Oct 2025, 15:02

little optimization code macomics.

Code:

mov xmm1 [float value]
mov  ecx,13
call xmpow

xmpow:
     movss xmm0, [.const1] 
     jecxz .exit
@@:  cmp ecx, 1
     jz .final
     test ecx, 1
     jz .next
      mulss xmm0, xmm1
.next:
      mulss xmm1, xmm1
      shr ecx, 1
      jmp @b
.final: mulss xmm0, xmm1 
.exit: ret

.const1 dd 1.0

20 Oct 2025, 15:02

macomics

Joined: 26 Jan 2021
Posts: 1197
Location: Russia

macomics 20 Oct 2025, 15:27

revolution wrote:

macomics wrote:
Let the assembler generate the desired sequence of mulss without cycles and conditions, depending on the specific exponent.
<snip>
The code posted generates one more multiply than necessary for n>1

I checked and everything is ok. It generates what I need.
The last mulss is just for the highest bit in the exponent.

Description:
Filesize:	246.17 KB
Viewed:	183 Time(s)

20 Oct 2025, 15:27

macomics

Joined: 26 Jan 2021
Posts: 1197
Location: Russia

macomics 20 Oct 2025, 15:44

Roman wrote:

little optimization code macomics.

Take a look at the latest changes in my code. There are other optimizations in the loop.

20 Oct 2025, 15:44

Roman

Joined: 21 Apr 2012
Posts: 2016

Roman 20 Oct 2025, 18:14

macro int_power_inline good. And some cases best variant.
But sometime we needed calculate with proc xmpow.
I start writing physic for my 3D game.
And I needed using proc xmpow.

20 Oct 2025, 18:14

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20753
Location: In your JS exploiting you and your system

revolution 20 Oct 2025, 22:56

macomics wrote:

I checked and everything is ok. It generates what I need.
The last mulss is just for the highest bit in the exponent.

I put this:

Code:

const1:
int_power_inline [esp],[esp],2

It generated this:

Code:

        movss xmm0,dword [0x0]
        movss xmm1,dword [esp]
        mulss xmm1,xmm1
        mulss xmm0,xmm1
        movss dword [esp],xmm0

That has two multiplies. But it can be done with one.

Code:

        movss xmm0,dword [esp]
        mulss xmm0,xmm0
        movss dword [esp],xmm0

The "const1" is never required, unless it needs to compute for n<=0. Starting at the highest bit and working down allows the code to generate the minimal multiplies.

20 Oct 2025, 22:56

macomics

Joined: 26 Jan 2021
Posts: 1197
Location: Russia

macomics 21 Oct 2025, 00:21

I agree. We can take away of multiplication by 1

Code:

macro int_power_inline p, x, n { local f, t
  if n = 0
    movss xmm0, [const1]
    movss p, xmm0
  else if n = 1
    movss xmm0, x
    movss p, xmm0
  else if n = -1
    movss xmm0, [const1]
    movss xmm1, x
    divss xmm0, xmm1
    movss p, xmm0
  else
    f = 0
    if n < 0
      t = 0 - n
    else
      t = n
    end if
    movss xmm1, x
    while t > 1
      if t and 1
        if f = 0
          f = 1
          movss xmm0, xmm1
        else
          mulss xmm0, xmm1
        end if
      end if
      mulss xmm1, xmm1
      t = t shr 1
    end while
    if f = 0
      if n < 0
        movss xmm0, [const1]
        divss xmm0, xmm1
        movss p, xmm0
      else
        movss p, xmm1
      end if
    else
      mulss xmm0, xmm1
      if n < 0
        movss xmm1, [const1]
        divss xmm1, xmm0
        movss p, xmm1
      else
        movss p, xmm0
      end if
    end if
  end if
}

21 Oct 2025, 00:21

revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 20753
Location: In your JS exploiting you and your system

revolution 21 Oct 2025, 00:34

A flaw with using n as a raw value in "t = 0 - n" is with this:

Code:

int_power_inline [esp],[esp],-1-1

The n needs to be put inside parentheses:

Code:

t = 0 - (n)

21 Oct 2025, 00:34

bitRAKE

Joined: 21 Jul 2003
Posts: 4307
Location: vpcmpistri

bitRAKE 21 Oct 2025, 00:34

What about something like:

Code:

; input: XMM0, ECX
; output XMM0 <- XMM0^ECX

; First, handle special cases: 0, 1 and powers of two:

.reduce:
        shr ecx, 1
        jz .done
        jc .first_bit
        vmulss  xmm0, xmm0, xmm0
        jmp .reduce

.first_bit:
        vmovss xmm1, xmm0
; perhaps replace above, for no merge dependancy from prior xmm1
;       vmovaps xmm1, xmm0

; Second hot loop to complete calculation:

.square:
        vmulss xmm1, xmm1, xmm1
        shr ecx, 1
        jnc .square
        vmulss xmm0, xmm0, xmm1
        jnz .square
.not_zero:
        retn

.done:  jc .not_zero ; perdict backward likely jump
        mov ecx, 1f ; 0x3f800000
        vmovd xmm0, ecx
        retn

; + No memory access.

edit: ... test/timing of algorithm.

21 Oct 2025, 00:34

Goto page Previous 1, 2

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum