flat assembler
Message board for the users of flat assembler.

Index > Main > Optimization challenge (big (?) deal with small bits ...)

Author
Thread Post new topic Reply to topic
DOS386



Joined: 08 Dec 2006
Posts: 1905
DOS386 08 Aug 2007, 05:39
Following problem:

Have an UINT32 containing 8 interesting bits only:

>> JJ B7 JJ JJ | JJ B6 JJ JJ | JJ B5 JJ JJ | JJ B4 JJ JJ | JJ B3 JJ JJ | JJ B2 JJ JJ | JJ B1 JJ JJ | JJ B0 JJ JJ <<

And want to extract them and put in one byte:

>> B7 B6 B5 B4 B3 B2 B1 B0 <<

What is the fastest way to do ? I can move every single bit separately, the solution results in 16 instructions. Anyone knows a faster way (80386 code, no MMX/SSSSE) ? Smile

_________________
Bug Nr.: 12345

Title: Hello World program compiles to 100 KB !!!

Status: Closed: NOT a Bug
Post 08 Aug 2007, 05:39
View user's profile Send private message Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 08 Aug 2007, 06:35
;Assume input in eax.
and eax, 44444444h
lea eax, [eax+8*eax]
and eax 60606060h
#if you have a fast multiplier
mov edx, 08208200h
mul edx ; output in dl
#else
lea edx, [8*eax]
shr eax, 3
or edx, eax
mov eax, edx
shr edx, 12
or eax, edx ; output in ah
Post 08 Aug 2007, 06:35
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 08 Aug 2007, 12:13
Code tags please and nicer spacing: Smile
Code:
FASTMUL=1
    mov eax,0F7264CD8h ;Your data in eax

    and eax, 44444444h
    lea eax, [9*eax]
    and eax, 60606060h
    if FASTMUL=1
      mov edx,08208200h
      mul edx ; output in dl
    else
      lea edx, [8*eax]
      shr eax, 3
      or  edx, eax
      mov eax, edx
      shr edx, 12
      or  eax, edx ; output in ah
    end if
    


@Xorpd! Btw, how did you come up with that magic value? mov edx,08208200h
Post 08 Aug 2007, 12:13
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
DOS386



Joined: 08 Dec 2006
Posts: 1905
DOS386 08 Aug 2007, 12:36
Thanks.

Code:
#if you have a fast multiplier
mov edx, 08208200h
mul edx ; output in dl 
    


MUL costs 38 $ ... oops 38 cycles (!!!) Shocked ... forget it Sad while "ordinary" instructions (incl. sophisticated LEA (??) Shocked ) cost only 2...4 - MUL would definitely pessimise my code Shocked

OTOH the no-MUL code looks promising ... should be almost 2x faster ... if it gives good results Smile I'll test ...

Code:
   ;input in EAX
   and eax, $44444444  ; 1. Zeroize out the junk
   lea eax, [9*eax]    ; 2. * 9 || SHL 3 followed by OR, drifted by 3 bits here
   and eax, $60606060  ; 3. Get rid of unuseful duplicates, $F0F0F0F0 would do also
   lea edx, [8*eax]    ; 4. EAX * 8 here || SHL 3
   shr eax, 3          ; 5. EAX / 8 here  || SHR 3
   or  edx, eax        ; 6. Put together, drifted by 6 bits
   mov eax, edx        ; 7. EAX<-EDX
   shr edx, 12         ; 8. Divide by 4096 || SHR 12
   or  eax, edx        ; 9. Put together, drifted by 12 bits here
   ;result in AH

0. A |J B7 J J | J B6 J J | J B5 J J | J B4 J J | J B3 J J | J B2 J J | J B1 J J | J B0 J J |
1. A |0 B7 0 0 | 0 B6 0 0 | 0 B5 0 0 | 0 B4 0 0 | 0 B3 0 0 | 0 B2 0 0 | 0 B1 0 0 | 0 B0 0 0 |
2. A |0 B7 B6 0| 0 B6 B5 0| 0 B5 B4 0| 0 B4 B3 0| 0 B3 B2 0| 0 B2 B1 0| 0 B1 B0 0| 0 B0 0 0 |
   & |0 1  1  0| 0 0  0  0| 0 1  1  0| 0 0  0  0| 0 1  1  0| 0 0  0  0| 0 1  1  0| 0 0  0 0 |
3. A |0 B7 B6 0| 0 0  0  0| 0 B5 B4 0| 0 0  0  0| 0 B3 B2 0| 0 0  0  0| 0 B1 B0 0| 0 0  0 0 |
4. D |0 0 0  0 | 0 0 B5 B4| 0 0 0  0 | 0 0 B3 B2| 0 0 0  0 | 0 0 B1 B0| 0 0 0  0 | 0 0  0 0 |
5. A |0 0 0  0 | B7 B6 0 0| 0  0  0 0| B5 B4 0 0| 0  0  0 0| B3 B2 0 0| 0  0  0 0| B1 B0 0 0|
   +
6. D |0 0 0  0 | B7B6 B5B4| 0  0  0 0| B5B4 B3B2| 0  0  0 0| B3B2 B1B0| 0  0  0 0| B1 B0 0 0|
7. A |0 0 0  0 | B7B6 B5B4| 0  0  0 0| B5B4 B3B2| 0  0  0 0| B3B2 B1B0| 0  0  0 0| B1 B0 0 0|
8. D |0 0 0  0 |0 0 0  0  | 0 0 0  0 | 0 0 0  0 | B7B6 B5B4| 0  0  0 0| B5B4 B3B2| 0  0  0 0|
   +
9. A |0 0 0  0 | B7B6 B5B4| 0  0 0 0 | B5B4 B3B2| B7B6 B5B4| B3B2 B1B0| B5B4 B3B2| B1 B0 0 0|
     |  junk   |   junk   |   junk   |   junk   |   ===   AH   ===    |   junk   |  junk   |
    


Seems to perfectly work in my "simulation" Smile

20 cycles, 2 of 32-bit registers trashed

Other ideas still welcome Wink


Last edited by DOS386 on 14 Aug 2007, 06:34; edited 3 times in total
Post 08 Aug 2007, 12:36
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 08 Aug 2007, 13:27
If memory isn't an issue you may want to try a lookup table
Fill the LUTLO and LUTHI with the correct returned values for every possible WORD sized bit stream and your all set. It will use 65536 * 2 bytes though.

movzx edx,ax
movzx ecx,byte[LUTLO+edx]
shr eax,16
movzx edx,byte[LUTHI+eax]
or ecx,edx
Post 08 Aug 2007, 13:27
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
DOS386



Joined: 08 Dec 2006
Posts: 1905
DOS386 08 Aug 2007, 13:38
> If memory isn't an issue you may want to try a lookup table
> It will use 65536 * 2 bytes though.

Costs 128 KiB ? No problem ... if it speeds up Idea

Code:
  ; input in EAX
  movzx edx,ax  ; Separate 16 lower bits
  movzx ecx,byte[LUTLO+edx]
  shr eax,16    ; Separate 16 higher bits
  movzx edx,byte[LUTHI+eax]
  or ecx,edx
  ; result in CL 
    


20 cycles, 3 32-bit registers trashed

Will test Wink
Post 08 Aug 2007, 13:38
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 09 Aug 2007, 07:32
MUL costs 38 $ ... oops 38 cycles (!!!) Shocked ...

Erm, what kind of CPU you're using. My MULs take 3 fused microops+5 latency which is 6 clocks at max (Core 2).

Pentium and Pentium MMX take 9 nonpairable (full) clocks.
Penritum Pro/II/III take 1 clock+4 latency which is 5 clocks at max.
Pentium Mobility takes 3 microops+5 latency (=6 clocks)
Pentium 4 takes 4 microops+16 latency (=17 or 18 clocks)
Pentium 4E takes 3 microops+11 latency (=12 clocks)
AMD64 takes 2 ops+3 latency (upto 5 clocks)

So unless you're using P4, you can't get these huge numbers. And even if you're on a P4, you can use SSE already, which can pack and multiply and all other nifty stuff Smile
Post 09 Aug 2007, 07:32
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
DOS386



Joined: 08 Dec 2006
Posts: 1905
DOS386 09 Aug 2007, 16:21
Quote:
Erm, what kind of CPU you're using. My MULs take


Is this a question ? Question Question

Maybe it had been answered before asked ...

I wrote:

Quote:
Anyone knows a faster way (80386 code, no MMX/SSSSE) ?


Laughing

Quote:
So unless you're using P4, you can't get these huge numbers. And even if you're on a P4


Thanks for P4 and AMD64 optimization hints, OTOH I don't expect performance problems on those Laughing

And YES, Xorpd!'s code was very useful for me Smile
Post 09 Aug 2007, 16:21
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 09 Aug 2007, 20:34
Oops, didn't read too carefully the first post :S Sorry!
Post 09 Aug 2007, 20:34
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 10 Aug 2007, 10:46
Are you going to be running the code on a 80386 though, or do you just want non-mmx/sse code?
Post 10 Aug 2007, 10:46
View user's profile Send private message Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 10 Aug 2007, 19:25
Seems that you may not have yet timed this for 80386, because I think mul may have an early-out because the Turbo Assembler Quick Reference Guide lists 9-38 clock cycles for this instruction. I wasn't sure that the '386 had shld and shrd, but it does and they are fast! Given this datum, every other instruction on that processor must have been shrd or shld, when it wasn't lea, because this works around the 2-register ISA limitations. Obviously this requires a modification to my previous mul-free solution:
Code:
; Assume input in eax
   and eax, 44444444h
   lea eax, [eax+8*eax]
   and eax, 60606060h
   shld edx, eax, 29 ; Similar to previous but 3 hi bits of edx are don't cares
   lea eax, [edx+8*eax]
   shld edx, eax, 20 ; Now hi 15 bits are don't cares
   or eax, edx
; Output in ah
    

Almost makes it look like a 3-register ISA, doesn't it? I get 16 clocks from the above reference, but I don't know whether decoding any prefixes may slow things down. Clearly needs testing on the target processor to be sure.

The LUT solution could be speeded up if it were possible do use 16-bit addressing because the first movzx could be made unnecessary. The other movzx operations could be abbreviated to mov operations which seem to be faster on the '386 and the LUT could be chopped in half if the combining instruction were lea instead of or. It becomes a rather tempting possibility if the program is running in real mode and it takes extra time to decode all the prefixes in the iterative versions. Still it's a big LUT, which tends to look better in benchmarking than in real life. Please let us know what kind of times you are getting in your tests.
Post 10 Aug 2007, 19:25
View user's profile Send private message Visit poster's website Reply with quote
DOS386



Joined: 08 Dec 2006
Posts: 1905
DOS386 10 Aug 2007, 19:40
> Are you going to be running the code on a 80386 though, or do you just want non-mmx/sse code?

Seems to be a mostly theoretical question ... there is IIRC not much "bit magic" between 80386 and MMX ... BSWAP maybe Confused

And YES, I want my code to be 80386 compatible whenever possible Wink

> Guide lists 9-38 clock cycles for this instruction.

YES ... but from my understanding at least one of the operands would have to be ridiculously small ... and this isn't the case Confused

> I wasn't sure that the '386 had shld and shrd, but it does and they are fast!

Thanks. Will look at this code also Smile

> and the LUT could be chopped in half if the combining instruction were lea instead of or. It becomes a rather tempting possibility if the program is running in real mode

It's 32-bit PM. But the LUT conflicted somewhat with surrounding code so I focused on your solution Wink

> Please let us know what kind of times you are getting in your tests.

Your code is very good (speed-up > 2x) ... but I have no 80386 by now so the times given in Intel's manual are the point of interest ... ?
Post 10 Aug 2007, 19:40
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 10 Aug 2007, 22:30
NTOSKRNL_VXE wrote:
And YES, I want my code to be 80386 compatible whenever possible Wink


80386 compatbile, or run on 80386 hardware? There's a big difference, optimization-wise...

_________________
carpe noctem
Post 10 Aug 2007, 22:30
View user's profile Send private message Reply with quote
DOS386



Joined: 08 Dec 2006
Posts: 1905
DOS386 14 Aug 2007, 06:41
Quote:
80386 compatbile, or run on 80386 hardware? There's a big difference, optimization-wise...


Unreproductable. Confused

Xorpd! wrote:

> I wasn't sure that the '386 had shld and shrd, but it does and they are fast!

Very good ! A pity not to use them Laughing

Code:
; Input in EAX

   and  eax, $44444444   ; 2 1. Zeroize out the junk
   lea  eax, [9*eax]     ; 2 2.
   and  eax, $60606060   ; 2 3. ZERO'ize out duplicates
   shld edx, eax, 29     ; 3 4. EDX:=EAX SHR 3 similar to previous but 3 hi bits of EDX are junk now
   lea  eax, [edx+8*eax] ; 2 5.
   shld edx, eax, 20     ; 3 6. EDX:=EAX SHR 12 now total hi 15 bits of EDX are junk
   or   eax, edx         ; 2 7.

; Result in AH , 16 cycles , 2 registers trashed

; 0. A |J B7 J J | J B6 J J | J B5 J J | J B4 J J | J B3 J J | J B2 J J | J B1 J J | J B0 J J |
; 1. A |0 B7 0 0 | 0 B6 0 0 | 0 B5 0 0 | 0 B4 0 0 | 0 B3 0 0 | 0 B2 0 0 | 0 B1 0 0 | 0 B0 0 0 |
; 2. A |0 B7 B6 0| 0 B6 B5 0| 0 B5 B4 0| 0 B4 B3 0| 0 B3 B2 0| 0 B2 B1 0| 0 B1 B0 0| 0 B0 0 0 |
;    & |0 1  1  0| 0 0  0  0| 0 1  1  0| 0 0  0  0| 0 1  1  0| 0 0  0  0| 0 1  1  0| 0 0  0 0 |
; 3. A |0 B7 B6 0| 0 0  0  0| 0 B5 B4 0| 0 0  0  0| 0 B3 B2 0| 0 0  0  0| 0 B1 B0 0| 0 0  0 0 |
; 4. A |J J J  0 | B7 B6 0 0| 0  0  0 0| B5 B4 0 0| 0  0  0 0| B3 B2 0 0| 0  0  0 0| B1 B0 0 0|
; 5. A |J J J  0 | B7B6 B5B4| 0  0  0 0| B5B4 B3B2| 0  0  0 0| B3B2 B1B0| 0  0  0 0| B1 B0 0 0|
; 6. D |J J J  J | J J J J  | J J J  J | J J J  0 | B7B6 B5B4| 0  0  0 0| B5B4 B3B2| 0  0  0 0|
;   +
; 7. A |J J J  J | J J J J  | J J J  J | J J J  B2| B7B6 B5B4| B3B2 B1B0| B5B4 B3B2| B1 B0 0 0|
;      |  junk   |   junk   |   junk   |   junk   |   ===   AH   ===    |   junk   |   junk   |
    


1.25 times faster again Smile


Last edited by DOS386 on 14 Aug 2007, 07:19; edited 1 time in total
Post 14 Aug 2007, 06:41
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 14 Aug 2007, 07:15
Am I really stupid, or is something wrong with that 12 Neutral - I think it MUST be 20!!!
...OR if you REALLY insist on twelve, then SHRD ..12...
Post 14 Aug 2007, 07:15
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
DOS386



Joined: 08 Dec 2006
Posts: 1905
DOS386 14 Aug 2007, 07:18
> think it MUST be 20!!!

Probably right Sad

_________________
Bug Nr.: 12345

Title: Hello World program compiles to 100 KB !!!

Status: Closed: NOT a Bug
Post 14 Aug 2007, 07:18
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 16 Aug 2007, 02:53
Impressive use of binary arithmetic !
Post 16 Aug 2007, 02:53
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 16 Aug 2007, 10:47
Yeah, impressive - I know how MUL 9 works: it packs in pairs of two, but I still don't know how to get the magic value of mov edx, 08208200h which pack pairs of two into 8 consecutive bits.

I hope there's a way to do it in just one multiplication which should be fast enough (that is when NOT on a 386 or earlier).

PS I remember I read somewhere that SHR(D) and SHL(D) have different timings. Something about shifting in one direction is more optimal. Maybe I'm mistaken because I couldn't find any hard evidence to my claim. Maybe one could try to replace SHRD reg1,reg2,imm with SHLD reg1,reg2,32-imm and vice versa.

PS2 Very Happy I know there isn't much difference between 4 and 1 clocks, but isn't MOVing around registers and SHifting, ROtating a tad faster than SHL/RD?
Post 16 Aug 2007, 10:47
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
DOS386



Joined: 08 Dec 2006
Posts: 1905
DOS386 16 Aug 2007, 20:59
> I read somewhere that SHR(D) and SHL(D) have different

Pointing to the "somewhere" would help Laughing

> know there isn't much difference between 4 and 1 clocks, but isn't MOVing around registers and SHifting, ROtating a tad faster than SHL/RD?

NOT according to my docs.
Post 16 Aug 2007, 20:59
View user's profile Send private message Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 17 Aug 2007, 20:46
Madis731 wrote:

Yeah, impressive - I know how MUL 9 works: it packs in pairs of two, but I still don't know how to get the magic value of mov edx, 08208200h which pack pairs of two into 8 consecutive bits.

Think about the positions of the bits we have and where we want them to be:
Code:
eax (input) | .. .. .. .. | .. .. .. .. |  0 B7 B6  0 |  0  0  0  0 |  0 B5 B4  0 |  0  0  0  0 |  0 B3 B2  0 |  0  0  0  0 |  0 B1 B0  0 |  0  0  0  0 |
dl (output) | B7 B6 B5 B4 | B3 B2 B1 B0 | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. |
bit #       | 39 38 37 36 | 35 34 33 32 | 31 30 29 28 | 27 26 25 24 | 23 22 21 20 | 19 18 17 16 | 15 14 13 12 | 11 10  9  8 |  7  6  5  4 |  3  2  1  0 |
    

To get B7 to the right position, we must shift it from bit #30 to bit #39. This is 9 bits, which is multiplication by 00000200h. B6 follows along, but B5 needs to go from bit #22 to bit #37 = 15 bits or multiplication by 00008000h. B4 is now OK, but we must shift B3 from bit #14 to bit #35 = 21 bits or multiplication by 00200000h. B2 OK; shift B1 from bit #6 to bit #33 = 27 bits or multiplication by 08000000h. This also sets up B0. After shifting we or the resulting pieces together:
Code:
   00000200h
   00008000h
   00200000h
or 08000000h
------------
   08208200h
    

where we have used the distributive law to combine the multiplicands into one magic multiplicand.

We have to check to make sure that the different pieces we are thinking about won't stomp on each other, but here it's easy to see that they don't: the B7 B6 piece ends up in bit positions congruent to 6 and 7 mod 8, B5 B4 at 4 and 5 mod 8, B3 B2 at 2 and 3 mod 8, and B1 B0 at 0 and 1 mod 8, so each piece along with the superfluous copies it makes lies in its own unique positions mod 8 in the output edx:eax so nobody overwrites good stuff with its superfluous bits nor causes a carry, so replacing the ors with adds is OK, too.

This is a trick one often looks for in buckets of bits code, see "Efficient Implementation of Population Count Function" in AMD's 22007.pdf.
Madis731 wrote:

I hope there's a way to do it in just one multiplication which should be fast enough (that is when NOT on a 386 or earlier).

You can see why I did the preliminary stuff: B7 and B6 started out 4 bits apart, so without the first round of consolidation I would have ended up with a copy of B6 at bit #35, interfering with the the B3 that I want to put there, among others.
Madis731 wrote:

PS I remember I read somewhere that SHR(D) and SHL(D) have different timings. Something about shifting in one direction is more optimal. Maybe I'm mistaken because I couldn't find any hard evidence to my claim. Maybe one could try to replace SHRD reg1,reg2,imm with SHLD reg1,reg2,32-imm and vice versa.

I have no such recollection, but bsf and bsr had different timings on Pentium Classic because they both used microcode but one used a linear loop and the other did a binary search. Also IIRC the microcode could rack up mispredicted branches...
Madis731 wrote:

PS2 I know there isn't much difference between 4 and 1 clocks, but isn't MOVing around registers and SHifting, ROtating a tad faster than SHL/RD?

On Core 2 Duo, for example, shld has 2 clocks latency but a mov and shr each have 1 clock latency and can execute in the same clock cycle, so the mov/shr combination can indeed be faster on some processors, just doesn't seem to be (without being able to measure; there may be some delay caused by decoding the 0f prefix on shld) on 80386.
Post 17 Aug 2007, 20:46
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.