flat assembler
Message board for the users of flat assembler.

Index > Macroinstructions > Emulating FMA4 with FMA and vice versa

Author
Thread Post new topic Reply to topic
Tomasz Grysztar



Joined: 16 Jun 2003
Posts: 8367
Location: Kraków, Poland
Tomasz Grysztar 14 Apr 2012, 20:32
Since AMD implemented FMA4 while Intel decided to use the 3-operand FMA, we now have two sets of instructions that do essentially the same, but have different syntaxes and encodings. But they are still so similar that it is possible to emulate one set with the other by using some simple macroinstructions to do the conversion. I decided to give it a try.

FMA4 is more flexible than FMA, so every FMA instruction can be encoded as FMA4 equivalent, but in the other direction it is possible only when two of the four operands are the same register.

So emulation of FMA with FMA4 instructions is very simple and should work for all the correct FMA syntaxes:
Code:
irps m, m nm { irps a, add sub addsub subadd \{ irps s, ps pd ss sd \\{
  macro vf#m\#a\\#132\\#s dest,src1,src2 \\\{ vf#m\#a\\#s dest,dest,src2,src1 \\\}
  macro vf#m\#a\\#213\\#s dest,src1,src2 \\\{ vf#m\#a\\#s dest,dest,src1,src2 \\\}
  macro vf#m\#a\\#231\\#s dest,src1,src2 \\\{ vf#m\#a\\#s dest,src1,src2,dest \\\}
\\} \} }


; example conversions:

vfmsub231ps  ymm1,ymm2,ymm3   ; vfmsubps  ymm1,ymm2,ymm3,ymm1
vfnmadd132sd xmm0,xmm5,[ebx]  ; vfnmaddsd xmm0,xmm0,[ebx],xmm5
vfmadd213pd  ymm0,ymm1,[esi]  ; vfmaddpd  ymm0,ymm0,ymm1,[esi]      


Emulation of FMA4 with FMA instructions is a bit harder, FMA4 instruction needs to have two of the operand being the same register, otherwise the simple conversion is not possible:
Code:
irps m, m nm { irps a, add sub addsub subadd \{ irps s, ps pd ss sd \\{

  macro vf#m\#a\\#s dest,src1,src2,src3 \\\{
    if dest eq src3
      vf#m\#a\\#231\\#s dest,src1,src2
    else if dest eq src2
      vf#m\#a\\#213\\#s dest,src1,src3
    else if dest eq src1
      if src2 eqtype [si] | src2 eqtype byte[si]
        vf#m\#a\\#132\\#s dest,src3,src2
      else
        vf#m\#a\\#213\\#s dest,src2,src3
      end if
    else
      err; not encodable
    end if
  \\\}

\\} \} }


; example conversions:

vfmaddpd  ymm0,ymm0,[esi],ymm2  ; vfmadd132pd  ymm0,ymm2,[esi]
vfmsubpd  ymm0,ymm1,[esi],ymm0  ; vfmsub231pd  ymm0,ymm1,[esi]
vfnmaddps ymm0,ymm1,ymm0,[esi]  ; vfnmadd213ps ymm0,ymm1,[esi]
vfmsubsd  xmm0,xmm0,xmm1,[esi]  ; vfmsub213sd  xmm0,xmm1,[esi]      
But if we allow additional move instruction to be generated then the complete emulation should be possible:
Code:
irps m, m nm { irps a, add sub addsub subadd \{ irps s, ps pd ss sd \\{

  macro vf#m\#a\\#s dest,src1,src2,src3 \\\{
    if dest eq src3
      vf#m\#a\\#231\\#s dest,src1,src2
    else if dest eq src2
      vf#m\#a\\#213\\#s dest,src1,src3
    else if dest eq src1
      if src2 eqtype [si] | src2 eqtype byte[si]
        vf#m\#a\\#132\\#s dest,src3,src2
      else
        vf#m\#a\\#213\\#s dest,src2,src3
      end if
    else
      if src3 eqtype [si] | src3 eqtype byte[si]
        if s eq ps | s eq pd
          vmova\\#s dest,src1
        else
          vmov\\#s dest,dest,src1
        end if
        vf#m\#a\\#213\\#s dest,src2,src3
      else if src1 eqtype [si] | src1 eqtype byte[si]
        if s eq ps | s eq pd
          vmova\\#s dest,src2
        else
          vmov\\#s dest,dest,src2
        end if
        vf#m\#a\\#132\\#s dest,src3,src1
      else
        if s eq ps | s eq pd
          vmova\\#s dest,src3
        else
          vmov\\#s dest,dest,src3
        end if
        vf#m\#a\\#231\\#s dest,src1,src2
      end if
    end if
  \\\}

\\} \} }

; example conversions:

vfmaddpd ymm0,ymm1,[esi],ymm2  ; vmovapd ymm0,ymm2
                               ; vfmadd231pd ymm0,ymm1,[esi]

vfmsubss xmm0,xmm1,xmm2,[ebx]  ; vmovss xmm0,xmm0,xmm1
                               ; vfmsub213ss xmm0,xmm2,dword [ebx]     


Note that I'm just playing with macros here - I do not claim that this is the optimal way to do such emulation nor that it is a good idea at all. Wink
Post 14 Apr 2012, 20:32
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo 16 Apr 2012, 05:14
(quick search) So this is actually implemented in available (!) hardware with AMD Bulldozer as of last October (2011)?? Heh, finally, something that isn't vaporware, sheesh.

(earlier search today) Seems that Ivy Bridge will finally ship later this month, too.

Anyways, kudos on the macros (though it seems there are new x86 extensions every week now, sheesh).
Post 16 Apr 2012, 05:14
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.