flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
Xorpd! 08 Aug 2007, 06:35
;Assume input in eax.
and eax, 44444444h lea eax, [eax+8*eax] and eax 60606060h #if you have a fast multiplier mov edx, 08208200h mul edx ; output in dl #else lea edx, [8*eax] shr eax, 3 or edx, eax mov eax, edx shr edx, 12 or eax, edx ; output in ah |
|||
![]() |
|
Madis731 08 Aug 2007, 12:13
Code tags please and nicer spacing:
![]() Code: FASTMUL=1 mov eax,0F7264CD8h ;Your data in eax and eax, 44444444h lea eax, [9*eax] and eax, 60606060h if FASTMUL=1 mov edx,08208200h mul edx ; output in dl else lea edx, [8*eax] shr eax, 3 or edx, eax mov eax, edx shr edx, 12 or eax, edx ; output in ah end if @Xorpd! Btw, how did you come up with that magic value? mov edx,08208200h |
|||
![]() |
|
DOS386 08 Aug 2007, 12:36
Thanks.
Code: #if you have a fast multiplier mov edx, 08208200h mul edx ; output in dl MUL costs 38 $ ... oops 38 cycles (!!!) ![]() ![]() ![]() ![]() OTOH the no-MUL code looks promising ... should be almost 2x faster ... if it gives good results ![]() Code: ;input in EAX and eax, $44444444 ; 1. Zeroize out the junk lea eax, [9*eax] ; 2. * 9 || SHL 3 followed by OR, drifted by 3 bits here and eax, $60606060 ; 3. Get rid of unuseful duplicates, $F0F0F0F0 would do also lea edx, [8*eax] ; 4. EAX * 8 here || SHL 3 shr eax, 3 ; 5. EAX / 8 here || SHR 3 or edx, eax ; 6. Put together, drifted by 6 bits mov eax, edx ; 7. EAX<-EDX shr edx, 12 ; 8. Divide by 4096 || SHR 12 or eax, edx ; 9. Put together, drifted by 12 bits here ;result in AH 0. A |J B7 J J | J B6 J J | J B5 J J | J B4 J J | J B3 J J | J B2 J J | J B1 J J | J B0 J J | 1. A |0 B7 0 0 | 0 B6 0 0 | 0 B5 0 0 | 0 B4 0 0 | 0 B3 0 0 | 0 B2 0 0 | 0 B1 0 0 | 0 B0 0 0 | 2. A |0 B7 B6 0| 0 B6 B5 0| 0 B5 B4 0| 0 B4 B3 0| 0 B3 B2 0| 0 B2 B1 0| 0 B1 B0 0| 0 B0 0 0 | & |0 1 1 0| 0 0 0 0| 0 1 1 0| 0 0 0 0| 0 1 1 0| 0 0 0 0| 0 1 1 0| 0 0 0 0 | 3. A |0 B7 B6 0| 0 0 0 0| 0 B5 B4 0| 0 0 0 0| 0 B3 B2 0| 0 0 0 0| 0 B1 B0 0| 0 0 0 0 | 4. D |0 0 0 0 | 0 0 B5 B4| 0 0 0 0 | 0 0 B3 B2| 0 0 0 0 | 0 0 B1 B0| 0 0 0 0 | 0 0 0 0 | 5. A |0 0 0 0 | B7 B6 0 0| 0 0 0 0| B5 B4 0 0| 0 0 0 0| B3 B2 0 0| 0 0 0 0| B1 B0 0 0| + 6. D |0 0 0 0 | B7B6 B5B4| 0 0 0 0| B5B4 B3B2| 0 0 0 0| B3B2 B1B0| 0 0 0 0| B1 B0 0 0| 7. A |0 0 0 0 | B7B6 B5B4| 0 0 0 0| B5B4 B3B2| 0 0 0 0| B3B2 B1B0| 0 0 0 0| B1 B0 0 0| 8. D |0 0 0 0 |0 0 0 0 | 0 0 0 0 | 0 0 0 0 | B7B6 B5B4| 0 0 0 0| B5B4 B3B2| 0 0 0 0| + 9. A |0 0 0 0 | B7B6 B5B4| 0 0 0 0 | B5B4 B3B2| B7B6 B5B4| B3B2 B1B0| B5B4 B3B2| B1 B0 0 0| | junk | junk | junk | junk | === AH === | junk | junk | Seems to perfectly work in my "simulation" ![]() 20 cycles, 2 of 32-bit registers trashed Other ideas still welcome ![]() Last edited by DOS386 on 14 Aug 2007, 06:34; edited 3 times in total |
|||
![]() |
|
r22 08 Aug 2007, 13:27
If memory isn't an issue you may want to try a lookup table
Fill the LUTLO and LUTHI with the correct returned values for every possible WORD sized bit stream and your all set. It will use 65536 * 2 bytes though. movzx edx,ax movzx ecx,byte[LUTLO+edx] shr eax,16 movzx edx,byte[LUTHI+eax] or ecx,edx |
|||
![]() |
|
DOS386 08 Aug 2007, 13:38
> If memory isn't an issue you may want to try a lookup table
> It will use 65536 * 2 bytes though. Costs 128 KiB ? No problem ... if it speeds up ![]() Code: ; input in EAX movzx edx,ax ; Separate 16 lower bits movzx ecx,byte[LUTLO+edx] shr eax,16 ; Separate 16 higher bits movzx edx,byte[LUTHI+eax] or ecx,edx ; result in CL 20 cycles, 3 32-bit registers trashed Will test ![]() |
|||
![]() |
|
Madis731 09 Aug 2007, 07:32
MUL costs 38 $ ... oops 38 cycles (!!!) Shocked ...
Erm, what kind of CPU you're using. My MULs take 3 fused microops+5 latency which is 6 clocks at max (Core 2). Pentium and Pentium MMX take 9 nonpairable (full) clocks. Penritum Pro/II/III take 1 clock+4 latency which is 5 clocks at max. Pentium Mobility takes 3 microops+5 latency (=6 clocks) Pentium 4 takes 4 microops+16 latency (=17 or 18 clocks) Pentium 4E takes 3 microops+11 latency (=12 clocks) AMD64 takes 2 ops+3 latency (upto 5 clocks) So unless you're using P4, you can't get these huge numbers. And even if you're on a P4, you can use SSE already, which can pack and multiply and all other nifty stuff ![]() |
|||
![]() |
|
DOS386 09 Aug 2007, 16:21
Quote: Erm, what kind of CPU you're using. My MULs take Is this a question ? ![]() ![]() Maybe it had been answered before asked ... I wrote: Quote: Anyone knows a faster way (80386 code, no MMX/SSSSE) ? ![]() Quote: So unless you're using P4, you can't get these huge numbers. And even if you're on a P4 Thanks for P4 and AMD64 optimization hints, OTOH I don't expect performance problems on those ![]() And YES, Xorpd!'s code was very useful for me ![]() |
|||
![]() |
|
Madis731 09 Aug 2007, 20:34
Oops, didn't read too carefully the first post :S Sorry!
|
|||
![]() |
|
f0dder 10 Aug 2007, 10:46
Are you going to be running the code on a 80386 though, or do you just want non-mmx/sse code?
|
|||
![]() |
|
Xorpd! 10 Aug 2007, 19:25
Seems that you may not have yet timed this for 80386, because I think mul may have an early-out because the Turbo Assembler Quick Reference Guide lists 9-38 clock cycles for this instruction. I wasn't sure that the '386 had shld and shrd, but it does and they are fast! Given this datum, every other instruction on that processor must have been shrd or shld, when it wasn't lea, because this works around the 2-register ISA limitations. Obviously this requires a modification to my previous mul-free solution:
Code: ; Assume input in eax and eax, 44444444h lea eax, [eax+8*eax] and eax, 60606060h shld edx, eax, 29 ; Similar to previous but 3 hi bits of edx are don't cares lea eax, [edx+8*eax] shld edx, eax, 20 ; Now hi 15 bits are don't cares or eax, edx ; Output in ah Almost makes it look like a 3-register ISA, doesn't it? I get 16 clocks from the above reference, but I don't know whether decoding any prefixes may slow things down. Clearly needs testing on the target processor to be sure. The LUT solution could be speeded up if it were possible do use 16-bit addressing because the first movzx could be made unnecessary. The other movzx operations could be abbreviated to mov operations which seem to be faster on the '386 and the LUT could be chopped in half if the combining instruction were lea instead of or. It becomes a rather tempting possibility if the program is running in real mode and it takes extra time to decode all the prefixes in the iterative versions. Still it's a big LUT, which tends to look better in benchmarking than in real life. Please let us know what kind of times you are getting in your tests. |
|||
![]() |
|
DOS386 10 Aug 2007, 19:40
> Are you going to be running the code on a 80386 though, or do you just want non-mmx/sse code?
Seems to be a mostly theoretical question ... there is IIRC not much "bit magic" between 80386 and MMX ... BSWAP maybe ![]() And YES, I want my code to be 80386 compatible whenever possible ![]() > Guide lists 9-38 clock cycles for this instruction. YES ... but from my understanding at least one of the operands would have to be ridiculously small ... and this isn't the case ![]() > I wasn't sure that the '386 had shld and shrd, but it does and they are fast! Thanks. Will look at this code also ![]() > and the LUT could be chopped in half if the combining instruction were lea instead of or. It becomes a rather tempting possibility if the program is running in real mode It's 32-bit PM. But the LUT conflicted somewhat with surrounding code so I focused on your solution ![]() > Please let us know what kind of times you are getting in your tests. Your code is very good (speed-up > 2x) ... but I have no 80386 by now so the times given in Intel's manual are the point of interest ... ? |
|||
![]() |
|
f0dder 10 Aug 2007, 22:30
NTOSKRNL_VXE wrote: And YES, I want my code to be 80386 compatible whenever possible 80386 compatbile, or run on 80386 hardware? There's a big difference, optimization-wise... _________________ carpe noctem |
|||
![]() |
|
DOS386 14 Aug 2007, 06:41
Quote: 80386 compatbile, or run on 80386 hardware? There's a big difference, optimization-wise... Unreproductable. ![]() Xorpd! wrote: > I wasn't sure that the '386 had shld and shrd, but it does and they are fast! Very good ! A pity not to use them ![]() Code: ; Input in EAX and eax, $44444444 ; 2 1. Zeroize out the junk lea eax, [9*eax] ; 2 2. and eax, $60606060 ; 2 3. ZERO'ize out duplicates shld edx, eax, 29 ; 3 4. EDX:=EAX SHR 3 similar to previous but 3 hi bits of EDX are junk now lea eax, [edx+8*eax] ; 2 5. shld edx, eax, 20 ; 3 6. EDX:=EAX SHR 12 now total hi 15 bits of EDX are junk or eax, edx ; 2 7. ; Result in AH , 16 cycles , 2 registers trashed ; 0. A |J B7 J J | J B6 J J | J B5 J J | J B4 J J | J B3 J J | J B2 J J | J B1 J J | J B0 J J | ; 1. A |0 B7 0 0 | 0 B6 0 0 | 0 B5 0 0 | 0 B4 0 0 | 0 B3 0 0 | 0 B2 0 0 | 0 B1 0 0 | 0 B0 0 0 | ; 2. A |0 B7 B6 0| 0 B6 B5 0| 0 B5 B4 0| 0 B4 B3 0| 0 B3 B2 0| 0 B2 B1 0| 0 B1 B0 0| 0 B0 0 0 | ; & |0 1 1 0| 0 0 0 0| 0 1 1 0| 0 0 0 0| 0 1 1 0| 0 0 0 0| 0 1 1 0| 0 0 0 0 | ; 3. A |0 B7 B6 0| 0 0 0 0| 0 B5 B4 0| 0 0 0 0| 0 B3 B2 0| 0 0 0 0| 0 B1 B0 0| 0 0 0 0 | ; 4. A |J J J 0 | B7 B6 0 0| 0 0 0 0| B5 B4 0 0| 0 0 0 0| B3 B2 0 0| 0 0 0 0| B1 B0 0 0| ; 5. A |J J J 0 | B7B6 B5B4| 0 0 0 0| B5B4 B3B2| 0 0 0 0| B3B2 B1B0| 0 0 0 0| B1 B0 0 0| ; 6. D |J J J J | J J J J | J J J J | J J J 0 | B7B6 B5B4| 0 0 0 0| B5B4 B3B2| 0 0 0 0| ; + ; 7. A |J J J J | J J J J | J J J J | J J J B2| B7B6 B5B4| B3B2 B1B0| B5B4 B3B2| B1 B0 0 0| ; | junk | junk | junk | junk | === AH === | junk | junk | 1.25 times faster again ![]() Last edited by DOS386 on 14 Aug 2007, 07:19; edited 1 time in total |
|||
![]() |
|
Madis731 14 Aug 2007, 07:15
Am I really stupid, or is something wrong with that 12
![]() ...OR if you REALLY insist on twelve, then SHRD ..12... |
|||
![]() |
|
DOS386 14 Aug 2007, 07:18
> think it MUST be 20!!!
Probably right ![]() _________________ Bug Nr.: 12345 Title: Hello World program compiles to 100 KB !!! Status: Closed: NOT a Bug |
|||
![]() |
|
r22 16 Aug 2007, 02:53
Impressive use of binary arithmetic !
|
|||
![]() |
|
Madis731 16 Aug 2007, 10:47
Yeah, impressive - I know how MUL 9 works: it packs in pairs of two, but I still don't know how to get the magic value of mov edx, 08208200h which pack pairs of two into 8 consecutive bits.
I hope there's a way to do it in just one multiplication which should be fast enough (that is when NOT on a 386 or earlier). PS I remember I read somewhere that SHR(D) and SHL(D) have different timings. Something about shifting in one direction is more optimal. Maybe I'm mistaken because I couldn't find any hard evidence to my claim. Maybe one could try to replace SHRD reg1,reg2,imm with SHLD reg1,reg2,32-imm and vice versa. PS2 ![]() |
|||
![]() |
|
DOS386 16 Aug 2007, 20:59
> I read somewhere that SHR(D) and SHL(D) have different
Pointing to the "somewhere" would help ![]() > know there isn't much difference between 4 and 1 clocks, but isn't MOVing around registers and SHifting, ROtating a tad faster than SHL/RD? NOT according to my docs. |
|||
![]() |
|
Xorpd! 17 Aug 2007, 20:46
Madis731 wrote:
Think about the positions of the bits we have and where we want them to be: Code: eax (input) | .. .. .. .. | .. .. .. .. | 0 B7 B6 0 | 0 0 0 0 | 0 B5 B4 0 | 0 0 0 0 | 0 B3 B2 0 | 0 0 0 0 | 0 B1 B0 0 | 0 0 0 0 | dl (output) | B7 B6 B5 B4 | B3 B2 B1 B0 | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. | .. .. .. .. | bit # | 39 38 37 36 | 35 34 33 32 | 31 30 29 28 | 27 26 25 24 | 23 22 21 20 | 19 18 17 16 | 15 14 13 12 | 11 10 9 8 | 7 6 5 4 | 3 2 1 0 | To get B7 to the right position, we must shift it from bit #30 to bit #39. This is 9 bits, which is multiplication by 00000200h. B6 follows along, but B5 needs to go from bit #22 to bit #37 = 15 bits or multiplication by 00008000h. B4 is now OK, but we must shift B3 from bit #14 to bit #35 = 21 bits or multiplication by 00200000h. B2 OK; shift B1 from bit #6 to bit #33 = 27 bits or multiplication by 08000000h. This also sets up B0. After shifting we or the resulting pieces together: Code: 00000200h 00008000h 00200000h or 08000000h ------------ 08208200h where we have used the distributive law to combine the multiplicands into one magic multiplicand. We have to check to make sure that the different pieces we are thinking about won't stomp on each other, but here it's easy to see that they don't: the B7 B6 piece ends up in bit positions congruent to 6 and 7 mod 8, B5 B4 at 4 and 5 mod 8, B3 B2 at 2 and 3 mod 8, and B1 B0 at 0 and 1 mod 8, so each piece along with the superfluous copies it makes lies in its own unique positions mod 8 in the output edx:eax so nobody overwrites good stuff with its superfluous bits nor causes a carry, so replacing the ors with adds is OK, too. This is a trick one often looks for in buckets of bits code, see "Efficient Implementation of Population Count Function" in AMD's 22007.pdf. Madis731 wrote:
You can see why I did the preliminary stuff: B7 and B6 started out 4 bits apart, so without the first round of consolidation I would have ended up with a copy of B6 at bit #35, interfering with the the B3 that I want to put there, among others. Madis731 wrote:
I have no such recollection, but bsf and bsr had different timings on Pentium Classic because they both used microcode but one used a linear loop and the other did a binary search. Also IIRC the microcode could rack up mispredicted branches... Madis731 wrote:
On Core 2 Duo, for example, shld has 2 clocks latency but a mov and shr each have 1 clock latency and can execute in the same clock cycle, so the mov/shr combination can indeed be faster on some processors, just doesn't seem to be (without being able to measure; there may be some delay caused by decoding the 0f prefix on shld) on 80386. |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.