flat assembler
Message board for the users of flat assembler.

 Index > Main > Optimization challenge (big (?) deal with small bits ...)
Author
DOS386

Joined: 08 Dec 2006
Posts: 1898
DOS386
Following problem:

Have an UINT32 containing 8 interesting bits only:

>> JJ B7 JJ JJ | JJ B6 JJ JJ | JJ B5 JJ JJ | JJ B4 JJ JJ | JJ B3 JJ JJ | JJ B2 JJ JJ | JJ B1 JJ JJ | JJ B0 JJ JJ <<

And want to extract them and put in one byte:

>> B7 B6 B5 B4 B3 B2 B1 B0 <<

What is the fastest way to do ? I can move every single bit separately, the solution results in 16 instructions. Anyone knows a faster way (80386 code, no MMX/SSSSE) ?

_________________
Bug Nr.: 12345

Title: Hello World program compiles to 100 KB !!!

Status: Closed: NOT a Bug
08 Aug 2007, 05:39
Xorpd!

Joined: 21 Dec 2006
Posts: 161
Xorpd!
;Assume input in eax.
and eax, 44444444h
lea eax, [eax+8*eax]
and eax 60606060h
#if you have a fast multiplier
mov edx, 08208200h
mul edx ; output in dl
#else
lea edx, [8*eax]
shr eax, 3
or edx, eax
mov eax, edx
shr edx, 12
or eax, edx ; output in ah
08 Aug 2007, 06:35

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Code tags please and nicer spacing:
Code:
```FASTMUL=1
mov eax,0F7264CD8h ;Your data in eax

and eax, 44444444h
lea eax, [9*eax]
and eax, 60606060h
if FASTMUL=1
mov edx,08208200h
mul edx ; output in dl
else
lea edx, [8*eax]
shr eax, 3
or  edx, eax
mov eax, edx
shr edx, 12
or  eax, edx ; output in ah
end if
```

@Xorpd! Btw, how did you come up with that magic value? mov edx,08208200h
08 Aug 2007, 12:13
DOS386

Joined: 08 Dec 2006
Posts: 1898
DOS386
Thanks.

Code:
```#if you have a fast multiplier
mov edx, 08208200h
mul edx ; output in dl
```

MUL costs 38 \$ ... oops 38 cycles (!!!) ... forget it while "ordinary" instructions (incl. sophisticated LEA (??) ) cost only 2...4 - MUL would definitely pessimise my code

OTOH the no-MUL code looks promising ... should be almost 2x faster ... if it gives good results I'll test ...

Code:
```   ;input in EAX
and eax, \$44444444  ; 1. Zeroize out the junk
lea eax, [9*eax]    ; 2. * 9 || SHL 3 followed by OR, drifted by 3 bits here
and eax, \$60606060  ; 3. Get rid of unuseful duplicates, \$F0F0F0F0 would do also
lea edx, [8*eax]    ; 4. EAX * 8 here || SHL 3
shr eax, 3          ; 5. EAX / 8 here  || SHR 3
or  edx, eax        ; 6. Put together, drifted by 6 bits
mov eax, edx        ; 7. EAX<-EDX
shr edx, 12         ; 8. Divide by 4096 || SHR 12
or  eax, edx        ; 9. Put together, drifted by 12 bits here
;result in AH

0. A |J B7 J J | J B6 J J | J B5 J J | J B4 J J | J B3 J J | J B2 J J | J B1 J J | J B0 J J |
1. A |0 B7 0 0 | 0 B6 0 0 | 0 B5 0 0 | 0 B4 0 0 | 0 B3 0 0 | 0 B2 0 0 | 0 B1 0 0 | 0 B0 0 0 |
2. A |0 B7 B6 0| 0 B6 B5 0| 0 B5 B4 0| 0 B4 B3 0| 0 B3 B2 0| 0 B2 B1 0| 0 B1 B0 0| 0 B0 0 0 |
& |0 1  1  0| 0 0  0  0| 0 1  1  0| 0 0  0  0| 0 1  1  0| 0 0  0  0| 0 1  1  0| 0 0  0 0 |
3. A |0 B7 B6 0| 0 0  0  0| 0 B5 B4 0| 0 0  0  0| 0 B3 B2 0| 0 0  0  0| 0 B1 B0 0| 0 0  0 0 |
4. D |0 0 0  0 | 0 0 B5 B4| 0 0 0  0 | 0 0 B3 B2| 0 0 0  0 | 0 0 B1 B0| 0 0 0  0 | 0 0  0 0 |
5. A |0 0 0  0 | B7 B6 0 0| 0  0  0 0| B5 B4 0 0| 0  0  0 0| B3 B2 0 0| 0  0  0 0| B1 B0 0 0|
+
6. D |0 0 0  0 | B7B6 B5B4| 0  0  0 0| B5B4 B3B2| 0  0  0 0| B3B2 B1B0| 0  0  0 0| B1 B0 0 0|
7. A |0 0 0  0 | B7B6 B5B4| 0  0  0 0| B5B4 B3B2| 0  0  0 0| B3B2 B1B0| 0  0  0 0| B1 B0 0 0|
8. D |0 0 0  0 |0 0 0  0  | 0 0 0  0 | 0 0 0  0 | B7B6 B5B4| 0  0  0 0| B5B4 B3B2| 0  0  0 0|
+
9. A |0 0 0  0 | B7B6 B5B4| 0  0 0 0 | B5B4 B3B2| B7B6 B5B4| B3B2 B1B0| B5B4 B3B2| B1 B0 0 0|
|  junk   |   junk   |   junk   |   junk   |   ===   AH   ===    |   junk   |  junk   |
```

Seems to perfectly work in my "simulation"

20 cycles, 2 of 32-bit registers trashed

Other ideas still welcome

Last edited by DOS386 on 14 Aug 2007, 06:34; edited 3 times in total
08 Aug 2007, 12:36
r22

Joined: 27 Dec 2004
Posts: 805
r22
If memory isn't an issue you may want to try a lookup table
Fill the LUTLO and LUTHI with the correct returned values for every possible WORD sized bit stream and your all set. It will use 65536 * 2 bytes though.

movzx edx,ax
movzx ecx,byte[LUTLO+edx]
shr eax,16
movzx edx,byte[LUTHI+eax]
or ecx,edx
08 Aug 2007, 13:27
DOS386

Joined: 08 Dec 2006
Posts: 1898
DOS386
> If memory isn't an issue you may want to try a lookup table
> It will use 65536 * 2 bytes though.

Costs 128 KiB ? No problem ... if it speeds up

Code:
```  ; input in EAX
movzx edx,ax  ; Separate 16 lower bits
movzx ecx,byte[LUTLO+edx]
shr eax,16    ; Separate 16 higher bits
movzx edx,byte[LUTHI+eax]
or ecx,edx
; result in CL
```

20 cycles, 3 32-bit registers trashed

Will test
08 Aug 2007, 13:38

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
MUL costs 38 \$ ... oops 38 cycles (!!!) Shocked ...

Erm, what kind of CPU you're using. My MULs take 3 fused microops+5 latency which is 6 clocks at max (Core 2).

Pentium and Pentium MMX take 9 nonpairable (full) clocks.
Penritum Pro/II/III take 1 clock+4 latency which is 5 clocks at max.
Pentium Mobility takes 3 microops+5 latency (=6 clocks)
Pentium 4 takes 4 microops+16 latency (=17 or 18 clocks)
Pentium 4E takes 3 microops+11 latency (=12 clocks)
AMD64 takes 2 ops+3 latency (upto 5 clocks)

So unless you're using P4, you can't get these huge numbers. And even if you're on a P4, you can use SSE already, which can pack and multiply and all other nifty stuff
09 Aug 2007, 07:32
DOS386

Joined: 08 Dec 2006
Posts: 1898
DOS386
Quote:
Erm, what kind of CPU you're using. My MULs take

Is this a question ?

I wrote:

Quote:
Anyone knows a faster way (80386 code, no MMX/SSSSE) ?

Quote:
So unless you're using P4, you can't get these huge numbers. And even if you're on a P4

Thanks for P4 and AMD64 optimization hints, OTOH I don't expect performance problems on those

And YES, Xorpd!'s code was very useful for me
09 Aug 2007, 16:21

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Oops, didn't read too carefully the first post :S Sorry!
09 Aug 2007, 20:34
f0dder

Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Are you going to be running the code on a 80386 though, or do you just want non-mmx/sse code?
10 Aug 2007, 10:46
Xorpd!

Joined: 21 Dec 2006
Posts: 161
Xorpd!
Seems that you may not have yet timed this for 80386, because I think mul may have an early-out because the Turbo Assembler Quick Reference Guide lists 9-38 clock cycles for this instruction. I wasn't sure that the '386 had shld and shrd, but it does and they are fast! Given this datum, every other instruction on that processor must have been shrd or shld, when it wasn't lea, because this works around the 2-register ISA limitations. Obviously this requires a modification to my previous mul-free solution:
Code:
```; Assume input in eax
and eax, 44444444h
lea eax, [eax+8*eax]
and eax, 60606060h
shld edx, eax, 29 ; Similar to previous but 3 hi bits of edx are don't cares
lea eax, [edx+8*eax]
shld edx, eax, 20 ; Now hi 15 bits are don't cares
or eax, edx
; Output in ah
```

Almost makes it look like a 3-register ISA, doesn't it? I get 16 clocks from the above reference, but I don't know whether decoding any prefixes may slow things down. Clearly needs testing on the target processor to be sure.

The LUT solution could be speeded up if it were possible do use 16-bit addressing because the first movzx could be made unnecessary. The other movzx operations could be abbreviated to mov operations which seem to be faster on the '386 and the LUT could be chopped in half if the combining instruction were lea instead of or. It becomes a rather tempting possibility if the program is running in real mode and it takes extra time to decode all the prefixes in the iterative versions. Still it's a big LUT, which tends to look better in benchmarking than in real life. Please let us know what kind of times you are getting in your tests.
10 Aug 2007, 19:25
DOS386

Joined: 08 Dec 2006
Posts: 1898
DOS386
> Are you going to be running the code on a 80386 though, or do you just want non-mmx/sse code?

Seems to be a mostly theoretical question ... there is IIRC not much "bit magic" between 80386 and MMX ... BSWAP maybe

And YES, I want my code to be 80386 compatible whenever possible

> Guide lists 9-38 clock cycles for this instruction.

YES ... but from my understanding at least one of the operands would have to be ridiculously small ... and this isn't the case

> I wasn't sure that the '386 had shld and shrd, but it does and they are fast!

Thanks. Will look at this code also

> and the LUT could be chopped in half if the combining instruction were lea instead of or. It becomes a rather tempting possibility if the program is running in real mode

It's 32-bit PM. But the LUT conflicted somewhat with surrounding code so I focused on your solution

> Please let us know what kind of times you are getting in your tests.

Your code is very good (speed-up > 2x) ... but I have no 80386 by now so the times given in Intel's manual are the point of interest ... ?
10 Aug 2007, 19:40
f0dder

Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
NTOSKRNL_VXE wrote:
And YES, I want my code to be 80386 compatible whenever possible

80386 compatbile, or run on 80386 hardware? There's a big difference, optimization-wise...

_________________
- carpe noctem
10 Aug 2007, 22:30
DOS386

Joined: 08 Dec 2006
Posts: 1898
DOS386
Quote:
80386 compatbile, or run on 80386 hardware? There's a big difference, optimization-wise...

Unreproductable.

Xorpd! wrote:

> I wasn't sure that the '386 had shld and shrd, but it does and they are fast!

Very good ! A pity not to use them

Code:
```; Input in EAX

and  eax, \$44444444   ; 2 1. Zeroize out the junk
lea  eax, [9*eax]     ; 2 2.
and  eax, \$60606060   ; 2 3. ZERO'ize out duplicates
shld edx, eax, 29     ; 3 4. EDX:=EAX SHR 3 similar to previous but 3 hi bits of EDX are junk now
lea  eax, [edx+8*eax] ; 2 5.
shld edx, eax, 20     ; 3 6. EDX:=EAX SHR 12 now total hi 15 bits of EDX are junk
or   eax, edx         ; 2 7.

; Result in AH , 16 cycles , 2 registers trashed

; 0. A |J B7 J J | J B6 J J | J B5 J J | J B4 J J | J B3 J J | J B2 J J | J B1 J J | J B0 J J |
; 1. A |0 B7 0 0 | 0 B6 0 0 | 0 B5 0 0 | 0 B4 0 0 | 0 B3 0 0 | 0 B2 0 0 | 0 B1 0 0 | 0 B0 0 0 |
; 2. A |0 B7 B6 0| 0 B6 B5 0| 0 B5 B4 0| 0 B4 B3 0| 0 B3 B2 0| 0 B2 B1 0| 0 B1 B0 0| 0 B0 0 0 |
;    & |0 1  1  0| 0 0  0  0| 0 1  1  0| 0 0  0  0| 0 1  1  0| 0 0  0  0| 0 1  1  0| 0 0  0 0 |
; 3. A |0 B7 B6 0| 0 0  0  0| 0 B5 B4 0| 0 0  0  0| 0 B3 B2 0| 0 0  0  0| 0 B1 B0 0| 0 0  0 0 |
; 4. A |J J J  0 | B7 B6 0 0| 0  0  0 0| B5 B4 0 0| 0  0  0 0| B3 B2 0 0| 0  0  0 0| B1 B0 0 0|
; 5. A |J J J  0 | B7B6 B5B4| 0  0  0 0| B5B4 B3B2| 0  0  0 0| B3B2 B1B0| 0  0  0 0| B1 B0 0 0|
; 6. D |J J J  J | J J J J  | J J J  J | J J J  0 | B7B6 B5B4| 0  0  0 0| B5B4 B3B2| 0  0  0 0|
;   +
; 7. A |J J J  J | J J J J  | J J J  J | J J J  B2| B7B6 B5B4| B3B2 B1B0| B5B4 B3B2| B1 B0 0 0|
;      |  junk   |   junk   |   junk   |   junk   |   ===   AH   ===    |   junk   |   junk   |
```

1.25 times faster again

Last edited by DOS386 on 14 Aug 2007, 07:19; edited 1 time in total
14 Aug 2007, 06:41

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Am I really stupid, or is something wrong with that 12 - I think it MUST be 20!!!
...OR if you REALLY insist on twelve, then SHRD ..12...
14 Aug 2007, 07:15
DOS386

Joined: 08 Dec 2006
Posts: 1898
DOS386
> think it MUST be 20!!!

Probably right

_________________
Bug Nr.: 12345

Title: Hello World program compiles to 100 KB !!!

Status: Closed: NOT a Bug
14 Aug 2007, 07:18
r22

Joined: 27 Dec 2004
Posts: 805
r22
Impressive use of binary arithmetic !
16 Aug 2007, 02:53

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Yeah, impressive - I know how MUL 9 works: it packs in pairs of two, but I still don't know how to get the magic value of mov edx, 08208200h which pack pairs of two into 8 consecutive bits.

I hope there's a way to do it in just one multiplication which should be fast enough (that is when NOT on a 386 or earlier).

PS I remember I read somewhere that SHR(D) and SHL(D) have different timings. Something about shifting in one direction is more optimal. Maybe I'm mistaken because I couldn't find any hard evidence to my claim. Maybe one could try to replace SHRD reg1,reg2,imm with SHLD reg1,reg2,32-imm and vice versa.

PS2 I know there isn't much difference between 4 and 1 clocks, but isn't MOVing around registers and SHifting, ROtating a tad faster than SHL/RD?
16 Aug 2007, 10:47
DOS386

Joined: 08 Dec 2006
Posts: 1898
DOS386
> I read somewhere that SHR(D) and SHL(D) have different

Pointing to the "somewhere" would help

> know there isn't much difference between 4 and 1 clocks, but isn't MOVing around registers and SHifting, ROtating a tad faster than SHL/RD?

NOT according to my docs.
16 Aug 2007, 20:59
Xorpd!

Joined: 21 Dec 2006
Posts: 161
Xorpd!

Yeah, impressive - I know how MUL 9 works: it packs in pairs of two, but I still don't know how to get the magic value of mov edx, 08208200h which pack pairs of two into 8 consecutive bits.

Think about the positions of the bits we have and where we want them to be:
Code:
```eax (input) | .. .. .. .. | .. .. .. .. |  0 B7 B6  0 |  0  0  0  0 |  0 B5 B4  0 |  0  0  0  0 |  0 B3 B2  0 |  0  0  0  0 |  0 B1 B0  0 |  0  0  0  0 |
dl (output) | B7 B6 B5 B4 | B3 B2 B1 B0 | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. |
bit #       | 39 38 37 36 | 35 34 33 32 | 31 30 29 28 | 27 26 25 24 | 23 22 21 20 | 19 18 17 16 | 15 14 13 12 | 11 10  9  8 |  7  6  5  4 |  3  2  1  0 |
```

To get B7 to the right position, we must shift it from bit #30 to bit #39. This is 9 bits, which is multiplication by 00000200h. B6 follows along, but B5 needs to go from bit #22 to bit #37 = 15 bits or multiplication by 00008000h. B4 is now OK, but we must shift B3 from bit #14 to bit #35 = 21 bits or multiplication by 00200000h. B2 OK; shift B1 from bit #6 to bit #33 = 27 bits or multiplication by 08000000h. This also sets up B0. After shifting we or the resulting pieces together:
Code:
```   00000200h
00008000h
00200000h
or 08000000h
------------
08208200h
```

where we have used the distributive law to combine the multiplicands into one magic multiplicand.

We have to check to make sure that the different pieces we are thinking about won't stomp on each other, but here it's easy to see that they don't: the B7 B6 piece ends up in bit positions congruent to 6 and 7 mod 8, B5 B4 at 4 and 5 mod 8, B3 B2 at 2 and 3 mod 8, and B1 B0 at 0 and 1 mod 8, so each piece along with the superfluous copies it makes lies in its own unique positions mod 8 in the output edx:eax so nobody overwrites good stuff with its superfluous bits nor causes a carry, so replacing the ors with adds is OK, too.

This is a trick one often looks for in buckets of bits code, see "Efficient Implementation of Population Count Function" in AMD's 22007.pdf.

I hope there's a way to do it in just one multiplication which should be fast enough (that is when NOT on a 386 or earlier).

You can see why I did the preliminary stuff: B7 and B6 started out 4 bits apart, so without the first round of consolidation I would have ended up with a copy of B6 at bit #35, interfering with the the B3 that I want to put there, among others.

PS I remember I read somewhere that SHR(D) and SHL(D) have different timings. Something about shifting in one direction is more optimal. Maybe I'm mistaken because I couldn't find any hard evidence to my claim. Maybe one could try to replace SHRD reg1,reg2,imm with SHLD reg1,reg2,32-imm and vice versa.

I have no such recollection, but bsf and bsr had different timings on Pentium Classic because they both used microcode but one used a linear loop and the other did a binary search. Also IIRC the microcode could rack up mispredicted branches...

PS2 I know there isn't much difference between 4 and 1 clocks, but isn't MOVing around registers and SHifting, ROtating a tad faster than SHL/RD?

On Core 2 Duo, for example, shld has 2 clocks latency but a mov and shr each have 1 clock latency and can execute in the same clock cycle, so the mov/shr combination can indeed be faster on some processors, just doesn't seem to be (without being able to measure; there may be some delay caused by decoding the 0f prefix on shld) on 80386.
17 Aug 2007, 20:46
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum