flat assembler
Message board for the users of flat assembler.

Index > Main > best way to do....

Author
Thread Post new topic Reply to topic
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
What is the best way to do:

( 32 bit unsigned ) * ( 32 bit signed ) -> 64 bit signed

on 32 bit x86?

SSE/MMX welcome
Post 20 Jun 2011, 20:23
View user's profile Send private message Reply with quote
bitshifter



Joined: 04 Dec 2007
Posts: 764
Location: Massachusetts, USA
bitshifter
can we use the fpu?
Code:
;data
a dd 1234
b dd -2
c rq 1
;code
fild  [a]
fimul [b]
fistp [c]    
Post 20 Jun 2011, 22:35
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
hmm that does seem like cheating...
Obviously, the fpu needs to be in extended precision mode.
If this is the case, are all of the bits of c correct for every possible input of a and b?

Edit: at least, the unsigned number needs to be zero extended to 64 bit and then loaded and a 64 bit int.
Post 20 Jun 2011, 22:46
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
OK - maybe I should make my motivations clear.
I'm trying to decide between two representations for fixed point multi-precision numbers. (the goal is speed)
1. store the number in 2's complement
2. store the sign and absolute value.

With 1, add/sub are fast and multiply requires these mixed unsigned/signed multiplications
With 2, add/sub becomes a little awkward but all multiplications are unsigned.

Any ideas?
Post 20 Jun 2011, 23:21
View user's profile Send private message Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3502
Location: Bulgaria
JohnFound
unsigned х abs(signed) (by "mul") instruction will give you 64bit result, that have to be inverted if the signed number is negative.
Post 20 Jun 2011, 23:24
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
ctl3d32



Joined: 30 Dec 2009
Posts: 204
Location: Brazil
ctl3d32
SSE2:

Code:
format PE GUI 4.0
entry start

include 'win32ax.inc'

section '.text' code readable executable

  start:

        invoke   GetModuleHandle,0
        mov      [h_process],eax
        cvtss2sd xmm0,dword[numbers]
        cvtss2sd xmm1,dword[numbers+4]
        mulpd    xmm0,xmm1
        movsd    [result],xmm0
        cinvoke  sprintf,caption,fmt,double[result]
        invoke   MessageBox,HWND_DESKTOP,caption,title,MB_OK
  exit:

        invoke  ExitProcess,0

section '.bss' data readable writable

  align 16
  h_process   dd ? ;handles to the executable process.
  result      dq ?
  caption     rb 40h

section '.data' data readable writable

  align 16
  title       db 'SSE2 Multiplication:',0
  fmt         db 'The result is %f',0
  numbers     dd 1000.0,-5.5

section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
          user32,'USER32.DLL',\
          msvcrt,'MSVCRT.DLL'

  include 'API\KERNEL32.INC'
  include 'API\USER32.INC'

  import  msvcrt,\
          sprintf,'sprintf'
    
Post 20 Jun 2011, 23:25
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
My implementation of JonhFound's idea:
Code:
; EAX = unsigned; EDX = signed

; Take EDX sign
mov ecx, edx
add ecx, ecx
sbb ecx, ecx

; Negate EDX if it was negative at entry
xor edx, ecx
sub edx, ecx

mul edx

; Negate result if EDX was negative at entry
xor eax, ecx
xor edx, ecx
sub eax, ecx
sbb edx, ecx
; Signed result in EDX:EAX; ECX destroyed    
Although I did some testing I'm not fully convinced the implementation is correct for all number pairs, handle with care. Also, please check if this code can be shorten (but without using branching)

The "Take EDX sign" part could be replaced with:
Code:
mov ecx, edx
sar ecx, 31    
But I'm unsure if it would actually be faster.

Yet another method:
Code:
; Take EDX sign
xor ecx, ecx
cmp edx, $8000'0000
adc ecx, -1    
This one would allow the CPU to execute the first two instructions in parallel, but due to the longer encoding, I'm not sure of the overall impact in the entire code.[/edit]
Post 21 Jun 2011, 02:23
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
After testing LocoDelAssembly implementation of JonhFound's idea, I found that
Code:
        mov  rax,qword[a]    ; signed
        mov  rcx,rax
        sar  rcx,63             ; sar is 1 cycle faster that add/sbb combo
        xor  rax,rcx
        sub  rax,rcx
        mul  qword[b]        ; unsigned
        xor  rax,rcx
        xor  rdx,rcx
        sub  rax,rcx
        sbb  rdx,rcx
        mov  qword[c+8*0],rax  ; signed
        mov  qword[c+8*1],rdx  ;           

is not bad at 5.33 cycles. the 32x32->64 instead of 64x64->128 was actualy SLOWER ????.
While (correct me if this code is not correct)
Code:
         movd  xmm0,dword[a]    ; signed
         movd  xmm1,dword[b]    ; unsigned
       movdqa  xmm2,xmm1
      pmuludq  xmm1,xmm0
        psllq  xmm2,32
        psllq  xmm0,32
        psubq  xmm2,xmm1
     blendvpd  xmm1,xmm2
         movq  qword[c],xmm1    ; signed               

is 3.33 cycles but only works for 32x32->64

Cycle count is for a loop consiting of just the above instructions.
I think that the conditional negation latency might hold back the multiplications enough that approach #2 might be better.

New question: how about this:
say we have

Code:
a_sign dq ?                   ; sign mask for a
a      dq ?,?,?,?             ; absolute value of a stored in 4 legs
b_sign dq ?                   ; dido for b
b      dq ?,?,?,?
c_sign dq ?                   ; and c
c      dq ?,?,?,?                         


what tricks can we pull to do c = a + b
Post 21 Jun 2011, 07:47
View user's profile Send private message Reply with quote
idle



Joined: 06 Jan 2011
Posts: 360
Location: Ukraine
idle
Post 21 Jun 2011, 14:10
View user's profile Send private message Reply with quote
Enko



Joined: 03 Apr 2007
Posts: 678
Location: Mar del Plata
Enko
idle wrote:
High Level Assembler
http://webster.cs.ucr.edu/
ide: http://sites.google.com/site/highlevelassembly/downloads/hide

downloading everything Smile

No fasm Iside Shocked
Post 21 Jun 2011, 15:40
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
OK - but I didn't have to download everything. With the usual 2's complement way of representing integers, it looks like a multiprecision c = a*b is implemented as:

c_sign = 0
if a < 0, a = neg(a); c_sign ^= 1;
if b < 0, b = neg(b); c_sign ^= 1;
c = a *b (unsigned!!)
if c_sign, c = neg(c)

I seriously doubt this is the fastest way
Post 21 Jun 2011, 15:55
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
@LocoDelAssembly
If we trash 1 more register and interleave the instructions we can avoid the partial stall
Code:
;;; let EAX = signed
;;; let ECX = unsigned
MOV EBX, EAX
CDQ ;;; EDX = EAX.sign ? -1 : 0
NEG EBX ;;; EBX = -EAX, SF = (-EAX).sign ? 1 : 0
XCHG EDX, ECX ;;; ECX = EAX.sign ? -1 : 0
CMOVNS EAX, EBX ;;; EAX = abs(EAX)
...
    


@tthsqe
With x86-64 32bU X 32bS = 64bS is pretty trivial, stop living in the past with your 32bit requirements Very Happy Very Happy
Code:
*** edit *** this code was removed because it apparently used an instruction set that only exists in my imagination.
    


Last edited by r22 on 22 Jun 2011, 02:01; edited 1 time in total
Post 21 Jun 2011, 17:29
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
Code:
MOVZX RDX, dword[a] ;unsigned 
MOVSX RAX, dword[b] ;signed 
IMUL RDX ; RAX signed     

On x84-64, one would hope to accomplish 64bU X 64bS = 128bS. This is what the posted code does.
Post 21 Jun 2011, 20:48
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
r22, "MOVZX RDX, dword[a]" has to be replaced with "MOV EDX, dword [a]" because the former doesn't exist. Same problem with "MOVSX RAX, dword[b]" but I don't know of a proper substitute.

[edit]Removed comment about your ABS(EAX) code, I was confused.[/edit]
Post 21 Jun 2011, 22:08
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17716
Location: In your JS exploiting you and your system
revolution
LocoDelAssembly wrote:
Same problem with "MOVSX RAX, dword[b]" but I don't know of a proper substitute.
Maybe this?
Code:
mov eax,dword[b]
cdqe    
Post 21 Jun 2011, 23:13
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Yes! I don't know why I thought that the "convert to sign-extended" instructions family always used DX/EDX/RDX as part of the destination.
Post 21 Jun 2011, 23:32
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 730
tthsqe
Code:
Same problem with "MOVSX RAX, dword[b]" but I don't know of a proper substitute.     


There is a movsxd rax,dword[b] command
Post 22 Jun 2011, 01:07
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
LocoDelAssembly wrote:
r22, "MOVZX RDX, dword[a]" has to be replaced with "MOV EDX, dword [a]" because the former doesn't exist. Same problem with "MOVSX RAX, dword[b]" but I don't know of a proper substitute.

Drat! foiled by Intel/AMD again... apparently I've only ever used MOV[S/Z]X with byte sized source operands, so I've managed to keep myself completely ignorant to the instructions' lack of dword mem -> qword encoding.
:oops:
Post 22 Jun 2011, 01:59
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.