flat assembler
Message board for the users of flat assembler.

Index > Main > Tricky assembler task

Author
Thread Post new topic Reply to topic
Super64



Joined: 03 May 2005
Posts: 2
Super64
Hi! I need a very elegant solution, but cannot make it by myself that's why i'm here:

We have two 64-bit registers RAX and RDX:
RAX: |0|1|0|1|1|0|0|1| -- correspond RSI
RDX: |0|1|1|1|0|0|0|0| -- correspond RDI
From theese two i need choose the pair RAX:RSI or RDX:RDI with max number of zero bytes in the right side in RAX or RDX. i.e. RDX

1st way
BSWAP RAX
BSWAP RDX
CMP RAX,RDX
CMOVB RDX,RAX
CMOVB RDI,RSI

I don't like this way because of additional bswaps...
Maybe there's another way and I just have missed it?

Thank you for your attention and consideration.
Post 03 May 2005, 09:37
View user's profile Send private message Reply with quote
PopeInnocent



Joined: 01 Jan 2004
Posts: 18
Location: USA
PopeInnocent
Looks like you destroy RAX and RDX, so you'd need one more bswap at the end of the code sequence.

What about:

bsr rcx,rax
bsr rbx,rdx
cmp rbx,rcx
cmovb rdx,rax
cmovb rdi,rsi

(Untested, but ought to work Smile

IIRC, bsr finds the least significant set bit. This should be the same as your sample code as long as every byte is either 1 or 0. But just looking at it, I can't say whether my code is any faster.
Post 03 May 2005, 23:30
View user's profile Send private message Reply with quote
Super64



Joined: 03 May 2005
Posts: 2
Super64
PopeInnocent wrote:
Looks like you destroy RAX and RDX, so you'd need one more bswap at the end of the code sequence.

What about:

bsr rcx,rax
bsr rbx,rdx
cmp rbx,rcx
cmovb rdx,rax
cmovb rdi,rsi

(Untested, but ought to work Smile

IIRC, bsr finds the least significant set bit. This should be the same as your sample code as long as every byte is either 1 or 0. But just looking at it, I can't say whether my code is any faster.


NOWAY! BSR(BSF) is a vectorpath instruction. Normally AMD64 processor can dispatch and execute 3(!) simple instruction (directpath) at each clock cycle. The vectorpath instructions block decoding unit for 1 clock cycle and the inner sequencer puts a lot of instructions corresponding to that particular vectorpath (complex) instruction. IN CASE OF BSR/BSF it will take 9-12 clock cycles just to complete that complex instruction. In 9 clock cycles you can complete up to 27(!) simple directpath instructions.

Actually I slightly modified that task and found blazingly fast solution.
I set up a check for at least 3 right zeroes. No need to bother with counting right zeroes if we have not enough of them...

XORQ (RSI),RAX
JZ VERYRARECASEJUMP
TESTL $0xFFFFFF,EAX
JZ RARECASEJUMP

so i've effectively skipped bad case behaviour. If there are at least 3 zeroes it will jump to counting handler.

But if you have 50% chance JUMP/NOT JUMP it won't do fast for sure. branch prediction won't work at random data.

Thank YOU, byebye!
Post 04 May 2005, 06:14
View user's profile Send private message Reply with quote
UCM



Joined: 25 Feb 2005
Posts: 285
Location: Canada
UCM
whoa that was rude. Wink

_________________
This calls for... Ultra CRUNCHY Man!
Ta da!! *crunch*
Post 07 May 2005, 00:04
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.