flat assembler
Message board for the users of flat assembler.

 Index > Main > best way to do....
Author
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 20 Jun 2011, 20:23
What is the best way to do:

( 32 bit unsigned ) * ( 32 bit signed ) -> 64 bit signed

on 32 bit x86?

SSE/MMX welcome
20 Jun 2011, 20:23
bitshifter

Joined: 04 Dec 2007
Posts: 790
Location: Massachusetts, USA
bitshifter 20 Jun 2011, 22:35
can we use the fpu?
Code:
```;data
a dd 1234
b dd -2
c rq 1
;code
fild  [a]
fimul [b]
fistp [c]    ```
20 Jun 2011, 22:35
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 20 Jun 2011, 22:46
hmm that does seem like cheating...
Obviously, the fpu needs to be in extended precision mode.
If this is the case, are all of the bits of c correct for every possible input of a and b?

Edit: at least, the unsigned number needs to be zero extended to 64 bit and then loaded and a 64 bit int.
20 Jun 2011, 22:46
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 20 Jun 2011, 23:21
OK - maybe I should make my motivations clear.
I'm trying to decide between two representations for fixed point multi-precision numbers. (the goal is speed)
1. store the number in 2's complement
2. store the sign and absolute value.

With 1, add/sub are fast and multiply requires these mixed unsigned/signed multiplications
With 2, add/sub becomes a little awkward but all multiplications are unsigned.

Any ideas?
20 Jun 2011, 23:21
JohnFound

Joined: 16 Jun 2003
Posts: 3499
Location: Bulgaria
JohnFound 20 Jun 2011, 23:24
unsigned х abs(signed) (by "mul") instruction will give you 64bit result, that have to be inverted if the signed number is negative.
20 Jun 2011, 23:24
ctl3d32

Joined: 30 Dec 2009
Posts: 204
Location: Brazil
ctl3d32 20 Jun 2011, 23:25
SSE2:

Code:
```format PE GUI 4.0
entry start

include 'win32ax.inc'

section '.text' code readable executable

start:

invoke   GetModuleHandle,0
mov      [h_process],eax
cvtss2sd xmm0,dword[numbers]
cvtss2sd xmm1,dword[numbers+4]
mulpd    xmm0,xmm1
movsd    [result],xmm0
cinvoke  sprintf,caption,fmt,double[result]
invoke   MessageBox,HWND_DESKTOP,caption,title,MB_OK
exit:

invoke  ExitProcess,0

section '.bss' data readable writable

align 16
h_process   dd ? ;handles to the executable process.
result      dq ?
caption     rb 40h

section '.data' data readable writable

align 16
title       db 'SSE2 Multiplication:',0
fmt         db 'The result is %f',0
numbers     dd 1000.0,-5.5

section '.idata' import data readable writeable

library kernel32,'KERNEL32.DLL',\
user32,'USER32.DLL',\
msvcrt,'MSVCRT.DLL'

include 'API\KERNEL32.INC'
include 'API\USER32.INC'

import  msvcrt,\
sprintf,'sprintf'
```
20 Jun 2011, 23:25
LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 21 Jun 2011, 02:23
My implementation of JonhFound's idea:
Code:
```; EAX = unsigned; EDX = signed

; Take EDX sign
mov ecx, edx
sbb ecx, ecx

; Negate EDX if it was negative at entry
xor edx, ecx
sub edx, ecx

mul edx

; Negate result if EDX was negative at entry
xor eax, ecx
xor edx, ecx
sub eax, ecx
sbb edx, ecx
; Signed result in EDX:EAX; ECX destroyed    ```
Although I did some testing I'm not fully convinced the implementation is correct for all number pairs, handle with care. Also, please check if this code can be shorten (but without using branching)

The "Take EDX sign" part could be replaced with:
Code:
```mov ecx, edx
sar ecx, 31    ```
But I'm unsure if it would actually be faster.

Yet another method:
Code:
```; Take EDX sign
xor ecx, ecx
cmp edx, \$8000'0000
adc ecx, -1    ```
This one would allow the CPU to execute the first two instructions in parallel, but due to the longer encoding, I'm not sure of the overall impact in the entire code.[/edit]
21 Jun 2011, 02:23
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 21 Jun 2011, 07:47
After testing LocoDelAssembly implementation of JonhFound's idea, I found that
Code:
```        mov  rax,qword[a]    ; signed
mov  rcx,rax
sar  rcx,63             ; sar is 1 cycle faster that add/sbb combo
xor  rax,rcx
sub  rax,rcx
mul  qword[b]        ; unsigned
xor  rax,rcx
xor  rdx,rcx
sub  rax,rcx
sbb  rdx,rcx
mov  qword[c+8*0],rax  ; signed
mov  qword[c+8*1],rdx  ;           ```

is not bad at 5.33 cycles. the 32x32->64 instead of 64x64->128 was actualy SLOWER ????.
While (correct me if this code is not correct)
Code:
```         movd  xmm0,dword[a]    ; signed
movd  xmm1,dword[b]    ; unsigned
movdqa  xmm2,xmm1
pmuludq  xmm1,xmm0
psllq  xmm2,32
psllq  xmm0,32
psubq  xmm2,xmm1
blendvpd  xmm1,xmm2
movq  qword[c],xmm1    ; signed               ```

is 3.33 cycles but only works for 32x32->64

Cycle count is for a loop consiting of just the above instructions.
I think that the conditional negation latency might hold back the multiplications enough that approach #2 might be better.

say we have

Code:
```a_sign dq ?                   ; sign mask for a
a      dq ?,?,?,?             ; absolute value of a stored in 4 legs
b_sign dq ?                   ; dido for b
b      dq ?,?,?,?
c_sign dq ?                   ; and c
c      dq ?,?,?,?                         ```

what tricks can we pull to do c = a + b
21 Jun 2011, 07:47
idle

Joined: 06 Jan 2011
Posts: 439
Location: Ukraine
idle 21 Jun 2011, 14:10
21 Jun 2011, 14:10
Enko

Joined: 03 Apr 2007
Posts: 676
Location: Mar del Plata
Enko 21 Jun 2011, 15:40
idle wrote:
High Level Assembler
http://webster.cs.ucr.edu/

No fasm Iside
21 Jun 2011, 15:40
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 21 Jun 2011, 15:55
OK - but I didn't have to download everything. With the usual 2's complement way of representing integers, it looks like a multiprecision c = a*b is implemented as:

c_sign = 0
if a < 0, a = neg(a); c_sign ^= 1;
if b < 0, b = neg(b); c_sign ^= 1;
c = a *b (unsigned!!)
if c_sign, c = neg(c)

I seriously doubt this is the fastest way
21 Jun 2011, 15:55
r22

Joined: 27 Dec 2004
Posts: 805
r22 21 Jun 2011, 17:29
@LocoDelAssembly
If we trash 1 more register and interleave the instructions we can avoid the partial stall
Code:
```;;; let EAX = signed
;;; let ECX = unsigned
MOV EBX, EAX
CDQ ;;; EDX = EAX.sign ? -1 : 0
NEG EBX ;;; EBX = -EAX, SF = (-EAX).sign ? 1 : 0
XCHG EDX, ECX ;;; ECX = EAX.sign ? -1 : 0
CMOVNS EAX, EBX ;;; EAX = abs(EAX)
...
```

@tthsqe
With x86-64 32bU X 32bS = 64bS is pretty trivial, stop living in the past with your 32bit requirements
Code:
```*** edit *** this code was removed because it apparently used an instruction set that only exists in my imagination.
```

Last edited by r22 on 22 Jun 2011, 02:01; edited 1 time in total
21 Jun 2011, 17:29
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 21 Jun 2011, 20:48
Code:
```MOVZX RDX, dword[a] ;unsigned
MOVSX RAX, dword[b] ;signed
IMUL RDX ; RAX signed     ```

On x84-64, one would hope to accomplish 64bU X 64bS = 128bS. This is what the posted code does.
21 Jun 2011, 20:48
LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 21 Jun 2011, 22:08
r22, "MOVZX RDX, dword[a]" has to be replaced with "MOV EDX, dword [a]" because the former doesn't exist. Same problem with "MOVSX RAX, dword[b]" but I don't know of a proper substitute.

Removed comment about your ABS(EAX) code, I was confused.[/edit]
21 Jun 2011, 22:08
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 19876
Location: In your JS exploiting you and your system
revolution 21 Jun 2011, 23:13
LocoDelAssembly wrote:
Same problem with "MOVSX RAX, dword[b]" but I don't know of a proper substitute.
Maybe this?
Code:
```mov eax,dword[b]
cdqe    ```
21 Jun 2011, 23:13
LocoDelAssembly
Your code has a bug

Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 21 Jun 2011, 23:32
Yes! I don't know why I thought that the "convert to sign-extended" instructions family always used DX/EDX/RDX as part of the destination.
21 Jun 2011, 23:32
tthsqe

Joined: 20 May 2009
Posts: 767
tthsqe 22 Jun 2011, 01:07
Code:
`Same problem with "MOVSX RAX, dword[b]" but I don't know of a proper substitute.     `

There is a movsxd rax,dword[b] command
22 Jun 2011, 01:07
r22

Joined: 27 Dec 2004
Posts: 805
r22 22 Jun 2011, 01:59
LocoDelAssembly wrote:
r22, "MOVZX RDX, dword[a]" has to be replaced with "MOV EDX, dword [a]" because the former doesn't exist. Same problem with "MOVSX RAX, dword[b]" but I don't know of a proper substitute.

Drat! foiled by Intel/AMD again... apparently I've only ever used MOV[S/Z]X with byte sized source operands, so I've managed to keep myself completely ignorant to the instructions' lack of dword mem -> qword encoding.
:oops:
22 Jun 2011, 01:59
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum