flat assembler
Message board for the users of flat assembler.
 Home   FAQ   Search   Register Profile   Log in to check your private messages   Log in
 flat assembler > Main > Another one of these algos, this time: reverse_bits_32() Goto page 1, 2  Next
Author
 Thread
Madis731

Joined: 25 Sep 2003
Posts: 2146
Location: Estonia

# Another one of these algos, this time: reverse_bits_32()

 Code: xor     eax,eax         ;p01   T=1         mov     ebx,12345678h   ;p01                    ;The constant that needs to be reversed         mov     ecx,ebx         ;p01   T=2         mov     edx,ebx         ;p01         and     ecx,0F0F0F0F0h  ;p01   T=3         and     edx,00F0F0F0Fh  ;p01         shr     ecx,4           ;p0    T=4         or      eax,ecx         ;p01         shl     edx,4           ;p0    T=5         or      eax,edx         ;p01         mov     ebx,eax         ;p01   T=6         mov     ecx,ebx         ;p01         mov     edx,ebx         ;p01   T=7         and     ecx,0CCCCCCCCh  ;p01         and     edx,033333333h  ;p01   T=8         shr     ecx,2           ;p0         mov     eax,ecx         ;p01   T=9         shl     edx,2           ;p0         or      eax,edx         ;p01   T=10         mov     ebx,eax         ;p01         mov     ecx,ebx         ;p01   T=11         mov     edx,ebx         ;p01         and     ecx,0AAAAAAAAh  ;p01   T=12         and     edx,055555555h  ;p01         shr     ecx,1           ;p0    T=13         mov     eax,ecx         ;p01         shl     edx,1           ;p0    T=14         or      eax,edx         ;p01         bswap   eax             ;p0+01 T=15-16                                 ;117 clocks on a P!!! <= WHY?                                 ;7-8 clocks on a P4 (PNI) <= This is what I expected from P!!!

_________________
My updated idol http://www.agner.org/optimize/
15 Mar 2006, 07:50
r22

Joined: 27 Dec 2004
Posts: 805
Wouldn't running each byte through a lookup table be a little faster?.
 Code: LUT: db 0 repeat 255 db ((% and 1) shl 7) or ((% and 2) shl 5) or ((% and 4) shl 3) or ((% and 8) shl 1) or ((% and 16) shr 1) or ((% and 32) shr 3) or ((% and 64) shr 5) or ((% and 128) shr 7) end repeat  ;in EAX ;out EAX reverse32: xor edx,edx mov dl,al mov dl, byte[LUT+edx] shr eax,8 mov cl,dl mov dl,al shl ecx,8 mov dl, byte[LUT+edx] shr eax,8 mov cl,dl mov dl,al shl ecx,8 mov dl, byte[LUT+edx] mov cl,dl mov dl,ah shl ecx,8 mov dl, byte[LUT+edx] mov cl,dl mov eax,ecx

Some optimization notes for the above, using or ecx,edx instead of mov cl,dl may be faster on newer processors, also movzx edx,byte[LUT+edx] may aslo be faster instead of mov dl,byte[LUT+edx].

Of course you'd need the added 256bytes for the LUT but the speed gain would be worth while. Depending on the memory constraints a word sized 65536 byte LUT may even be worth it.
16 Mar 2006, 05:16
Ivan2k2

Joined: 08 Sep 2004
Posts: 79
Location: Russia, Angarsk
my algo:

 Code: mov eax,12345678h ;;;;;;;;;;;;;;;;;;;;;;;; mov ebx,eax and eax,0x0f0f0f0f and ebx,0xf0f0f0f0 shl eax,4 shr ebx,4 or  eax,ebx mov ebx,eax and eax,0x33333333 and ebx,0xcccccccc shl eax,2 shr ebx,2 or  eax,ebx mov ebx,eax and eax,0x55555555 and ebx,0xaaaaaaaa shl eax,1 shr ebx,1 or  eax,ebx bswap eax ;;;;;;;;;;;;;;;;;;;;;;;;

it uses only two registers: EAX and EBX
16 Mar 2006, 05:46
Madis731

Joined: 25 Sep 2003
Posts: 2146
Location: Estonia
@Ivan2k2:
Why I used more registers was because of the parallelism. With two registers, you can roughly do
clocks_taken=#_of_instructions/2
When assigning two other registers to do the job, you can generally divide it by four. And in your case - you need to drag the SHL/SHR apart because otherwise they just waste your clocks. Because you can't (as I see from the code) you may aswell replace shl1 => add, shl2 => add\add, shl4 => add\add\add and get one or two cycles off.
@r22:
Sorry, I didn't add that my goal was one-pass speed. Look-up tables are usables when the code needed to run through would be larger than the lookup-table. In this case, it wastes about 40-80 clocks to initialize the 8 parts of the table (32 bytes is one stride long). This lookup table isn't the most stable either. In the best case scenario it might contain 0 in the register and init only one 32-byte stride and happily finish in 10 or less clocks. I already investigated in the lookup table behaviour, but I didn't like it because of the lookup tables I thought I'd give a shot without them.
Another point against LUT is porting to, say, 64 bits. Then you need 8xbyte-conversions or 4xword-conversions. If you look at the instructions used - you can see that extending to 64 isn't much of a problem. My only concern it when 8-byte constants are used, the opcodes generated will be too long
16 Mar 2006, 11:38
r22

Joined: 27 Dec 2004
Posts: 805
After some testing, I agree with you Madis, the LUT timings are slightly slower.

I cleaned up YOUR version a little and it currently runs 25% faster than your original.

Heres the snippet from the benchmarking code I use.
 Code: call    [GetCurrentProcess] ;returns -1         push    100h         push    eax         call    [SetPriorityClass]         call    [GetCurrentThread];;returns -2         push    15         push    eax         call    [SetThreadPriority]     push 0     push 0     push 0     push 0     call [MessageBox] ;---------------------------------------     mov esi,07FFFFFFh     call [GetTickCount]     mov edi,eax     jmp tst1 align 16 tst1:     call Reverse1     dec esi     jnz tst1     push eax     push _fmth     call [printf]     call [GetTickCount]     sub eax,edi     push eax     push result1     call [printf]     mov esi,07FFFFFFh     call [GetTickCount]     mov edi,eax     jmp tst2 align 16 tst2:     call Reverse2     dec esi     jnz tst2     push eax     push _fmth     call [printf]     call [GetTickCount]     sub eax,edi     push eax     push result2     call [printf] ;+++++++++++++++++++++++++++++++++++++++++++++      push 0      push buffer      push buffer      push 0      call [MessageBox]      push 0      call [ExitProcess] align 16 Reverse2:         xor     ecx,ecx         mov     eax,12345678h         bswap   eax         mov     ebx,eax         shr     eax,4         shl     ebx,4         and     eax, 0F0F0F0Fh         and     ebx,0F0F0F0F0h         or      ecx,eax         or      ecx,ebx         or      eax,ebx         and     ecx,0CCCCCCCCh         and     eax, 33333333h         shr     ecx,2         shl     eax,2         or      eax,ecx         mov     ebx,eax         and     eax,0AAAAAAAAh         and     ebx, 55555555h         shr     eax,1         shl     ebx,1         or     eax,ebx         ret     0 align 16 Reverse1:         xor     eax,eax         mov     ebx,12345678h         mov     ecx,ebx         mov     edx,ebx         and     ecx,0F0F0F0F0h         and     edx,00F0F0F0Fh         shr     ecx,4         or      eax,ecx         shl     edx,4         or      eax,edx         mov     ebx,eax         mov     ecx,ebx         mov     edx,ebx         and     ecx,0CCCCCCCCh         and     edx,033333333h         shr     ecx,2         mov     eax,ecx         shl     edx,2         or      eax,edx         mov     ebx,eax         mov     ecx,ebx         mov     edx,ebx         and     ecx,0AAAAAAAAh         and     edx,055555555h         shr     ecx,1         mov     eax,ecx         shl     edx,1         or      eax,edx         bswap   eax         ret     0
17 Mar 2006, 04:08
Ivan2k2

Joined: 08 Sep 2004
Posts: 79
Location: Russia, Angarsk
yet another code:

 Code: mov     eax,12345678h          bswap   eax          mov     ebx,eax          shr     eax,4          shl     ebx,4          and     eax, 0F0F0F0Fh          and     ebx,0F0F0F0F0h          ;or      ecx,eax          ;or      ecx,ebx          lea     ecx,[eax+ebx]         ;or      eax,ebx          add     eax,ebx         and     ecx,0CCCCCCCCh          and     eax, 33333333h          shr     ecx,2          lea     eax,[eax*4+ecx]         mov     ebx,eax          and     eax,0AAAAAAAAh          and     ebx, 55555555h          shr     eax,1          lea     eax,[eax+ebx*2]
17 Mar 2006, 06:12
r22

Joined: 27 Dec 2004
Posts: 805
Nice, Ivan, your code is 8% faster on my benchmark. Using lea to combine a few instructions shows some definet speed improvement.

The only bottleneck is
lea ecx,[eax+ebx]
add eax,ebx

I made a few tweaks, and got rid of the EBX so it can be used as a function without worrying about register preservation.

It's only faster by an extra 5% (I just had to one-up you Ivan).

 Code: mov     eax,12345678h         bswap   eax         mov     edx,eax         shl     eax,4         shr     edx,4         and     eax,0F0F0F0F0h         and     edx, 0F0F0F0Fh         or      eax,edx ;;;;;;;;;;;;;;;;;;;;;;;;;;;;         mov     ecx,eax ;;;;;;;;;;;;;;;;;;;;;;;;;;;         and     eax, 33333333h         and     ecx,0CCCCCCCCh         shr     ecx,2         lea     eax,[eax*4+ecx]         mov     edx,eax         shr     eax,1         and     edx,055555555h         and     eax,055555555h         lea     eax,[eax+edx*2]

I'm running an AMD x2 3800+, I'd be interested if its faster or slower on a pentium.

Here's the benchmarking code with the two fastest algorithms
 Code: ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;         call    [GetCurrentProcess] ;returns -1         push    100h         push    eax         call    [SetPriorityClass]         call    [GetCurrentThread];;returns -2         push    15         push    eax         call    [SetThreadPriority]     push 0     push 0     push 0     push 0     call [MessageBox] ;---------------------------------------     mov esi,0bFFFFFFh     call [GetTickCount]     mov edi,eax     jmp tst1 align 16 tst1:     call Reverse1     dec esi     jnz tst1     push eax     push _fmth     call [printf]     call [GetTickCount]     sub eax,edi     push eax     push result1     call [printf]     mov esi,0bFFFFFFh     call [GetTickCount]     mov edi,eax     jmp tst2 align 16 tst2:     call Reverse2     dec esi     jnz tst2     push eax     push _fmth     call [printf]     call [GetTickCount]     sub eax,edi     push eax     push result2     call [printf] ;+++++++++++++++++++++++++++++++++++++++++++++      push 0      push buffer      push buffer      push 0      call [MessageBox]      push 0      call [ExitProcess] align 16 Reverse2:         mov     eax,12345678h         bswap   eax         mov     edx,eax         shl     eax,4         shr     edx,4         and     eax,0F0F0F0F0h         and     edx, 0F0F0F0Fh         or      eax,edx         mov     ecx,eax         and     eax, 33333333h         and     ecx,0CCCCCCCCh         shr     ecx,2         lea     eax,[eax*4+ecx]         mov     edx,eax         shr     eax,1         and     edx,055555555h         and     eax,055555555h         lea     eax,[eax+edx*2]         ret     0 align 16 Reverse1:         mov     eax,12345678h         bswap   eax           mov     ebx,eax           shr     eax,4           shl     ebx,4           and     eax, 0F0F0F0Fh           and     ebx,0F0F0F0F0h           ;or      ecx,eax           ;or      ecx,ebx           lea     ecx,[eax+ebx]          ;or      eax,ebx           add     eax,ebx          and     ecx,0CCCCCCCCh           and     eax, 33333333h           shr     ecx,2           lea     eax,[eax*4+ecx]          mov     ebx,eax           and     eax,0AAAAAAAAh           and     ebx, 55555555h           shr     eax,1           lea     eax,[eax+ebx*2]          ret     0
17 Mar 2006, 07:34
Madis731

Joined: 25 Sep 2003
Posts: 2146
Location: Estonia
I tried your code and got the following results:
 Code: Vastus:510274632 Madis's version=1891 s Vastus:510274632 Ivan+r22's try=1906 s Vastus:510274632 ASM GEMS=1594 s

With THIS:
http://enos.itcollege.ee/~mkalme/PAHN/Up/cpu-2992.png

The last one is from http://www.df.lth.se/~john_e/gems/gem0018.html
There's also a population count algo for my liking there
17 Mar 2006, 09:22
r22

Joined: 27 Dec 2004
Posts: 805
Interesting stuff. Just shows the pitfalls of optimizing algorithms. The ASM Gems runs at the exact timing as Ivan's tweaked version on my AMD cpu but on Madis' Intel cpu the gems version is a good deal faster.

I personally like the ASM Gem's version better, the XOR's instead of a second dword immediate value AND is much more efficient.

Interesting thread Madis
17 Mar 2006, 21:44
RedGhost

Joined: 18 May 2005
Posts: 444
Location: BC, Canada

Madis731 wrote:
I tried your code and got the following results:
 Code: Vastus:510274632 Madis's version=1891 s Vastus:510274632 Ivan+r22's try=1906 s Vastus:510274632 ASM GEMS=1594 s

With THIS:
http://enos.itcollege.ee/~mkalme/PAHN/Up/cpu-2992.png

The last one is from http://www.df.lth.se/~john_e/gems/gem0018.html
There's also a population count algo for my liking there

off-topic but do you have the pentium4 3.0ghz 64bit with hyper-threading? i think we have the same processor, but mine was 3.01 out of the box (yeah 0.01 ghz for free!)

_________________
redghost.ca
18 Mar 2006, 02:02
Madis731

Joined: 25 Sep 2003
Posts: 2146
Location: Estonia
The 64-bit ones should be with 3.066Ghz and CPUz should say anything between 2.9-3.1 so the 10MHz is not anything to worry about. The also should have EIST that sometimes cuts off 200MHz (one ABit MoBo that I tested) on idle and maybe even more like around 1GHz (an example from laptops).

Mine isn't 64-bit as otherwise you should see the EM64T marking on my CPUz screenshot.

Another thing is that isn't your 64-bit CPU dual-core instead of HT?

BTW, r22, what kind of an AMD you have. How's it clocked and what other specs it has?

Any other types of benchmarks?

PS. I will test the instruction sequence at home on my P!!! laptop because I know that the P!!!s are the BEST in the history of computer science
20 Mar 2006, 09:26
Ivan2k2

Joined: 08 Sep 2004
Posts: 79
Location: Russia, Angarsk
Hello again...
Here my new variant of this algo, which is faster (on my CPU) than Madis`s algo
my cpu: p4 celeron 1.8 ghz

 Code: mov eax,0x12345678 mov ebx,eax mov ecx,eax mov edx,ebx and eax,0x88888888 and ecx,0x11111111 and ebx,0x44444444 and edx,0x22222222 shr eax,3 shr ebx,1 lea ebx,[ebx+edx*2] lea eax,[eax+ecx*8] add eax,ebx             bswap eax mov   ebx,eax shr   eax,4  shl   ebx,4  and   eax,0x0f0f0f0f and   ebx,0xf0f0f0f0 add   eax,ebx
20 Mar 2006, 12:03
r22

Joined: 27 Dec 2004
Posts: 805
My cpu, 2ghz AMD x2 3800+

Ivan your new code runs at almost exactly the same speed, as your old code (that I modified) and the ASM Gem's code madis provided the link for.

Running the three in loops of 0FFFFFFFh they only vary by an average of 15ms. With no consistant winner.
21 Mar 2006, 03:38
Ivan2k2

Joined: 08 Sep 2004
Posts: 79
Location: Russia, Angarsk
2 r22: Are you sure that my algo produces wrong results ???

i checked it under debugger many times...
0x12345678 -> 0x1e6a2c48
0x9abcdef0 -> 0x0f7b3d59
0x1e6a2c48 -> 0x12345678
21 Mar 2006, 03:47
r22

Joined: 27 Dec 2004
Posts: 805
It doesn't produce the wrong results it just doesn't run any faster in the benchmark.

The fastest solution for 32bit is still the 16bit LookUpTable
 Code: mov eax,12345678h         mov ebx,RevLUT         movzx ecx,ax         shr eax,16         movzx edx,word[ebx+ecx*2]         movzx eax,word[ebx+eax*2]         shl edx,16         or eax,edx

Runs 30% faster than any of the other algorithms.

Timings 0FFFFFFFh iterations
LUT Version 937ms
ThreeFastestNonLUT 1343-1360ms

The LUT costs the most memory 65536 words worth, but if top speed is what you're after it can't be beat for 32bit (It'll scale to 64bit as well as the 1byte LUT did in 32bit).

I wonder if there's an obscure floating point solution, like
mov eax,12345678h
fld eax
fmul [SomeMagicNumber]
fstp [esp-8]
mov eax,[esp-5?]
21 Mar 2006, 04:45
Madis731

Joined: 25 Sep 2003
Posts: 2146
Location: Estonia
There might be - I'm already studying it. I saw some logic in the:
0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F => 0,8,4,C,2,A,6,E,1,9,5,D,3,B,7,F

Its like this:
1st, 9th, 5th, 13th, 3rd, 11th, 7th, 15th, 2nd, 10th, 6th, 14th, 4th, 12th, 8th, 16th.

1 +8+12+8+6+8+12+8+6+8+12+8+6+etc.
taking N%16 of the result

You can get something like:
Xn=1+N/4*34+(8 if N && 1)+(20 if N && 2)

I think they can be done with conditional moves

if both bit 1 and 2 are up, then it combines to 3 and 28 is added.

Now I'm gonna go and assemble this

PS. above 8 you need to subtract 3 from the answer
21 Mar 2006, 08:12
Ivan2k2

Joined: 08 Sep 2004
Posts: 79
Location: Russia, Angarsk
hmm...
2 Madis: i'll try your idea too.... =)))))
2 r22: maybe my code runs faster on intel p4 ???
21 Mar 2006, 10:39
RedGhost

Joined: 18 May 2005
Posts: 444
Location: BC, Canada
 Madis731 wrote: The 64-bit ones should be with 3.066Ghz and CPUz should say anything between 2.9-3.1 so the 10MHz is not anything to worry about. The also should have EIST that sometimes cuts off 200MHz (one ABit MoBo that I tested) on idle and maybe even more like around 1GHz (an example from laptops). Mine isn't 64-bit as otherwise you should see the EM64T marking on my CPUz screenshot. Another thing is that isn't your 64-bit CPU dual-core instead of HT? BTW, r22, what kind of an AMD you have. How's it clocked and what other specs it has? Any other types of benchmarks? PS. I will test the instruction sequence at home on my P!!! laptop because I know that the P!!!s are the BEST in the history of computer science

ah didnt think EM64T would be listed

nope, single-core, just with HT

i remember my old p!!! 733mhz, and when i got it was the best out at the time and all my friends were jealous haha, i probably couldn't sell it for \$20 now >_>

_________________
redghost.ca
21 Mar 2006, 12:03
Ivan2k2

Joined: 08 Sep 2004
Posts: 79
Location: Russia, Angarsk
challenge continues... =)

next code a little faster than any last algos without LUT (on intel p4 celeron 1.8 )
what about amd ???
 Code: mov   eax,0x1e6a2c48 mov   ecx,eax and   eax,0x33333333 and   ecx,0xcccccccc mov   ebx,eax mov   edx,ecx lea   eax,[eax*4+ebx] and   eax,0x66666666 shr   ecx,2 add   ecx,edx and   ecx,0x66666666 shr   ecx,1 lea   eax,[eax*2+ecx] bswap eax mov   ebx,eax shr   eax,4  shl   ebx,4  and   eax,0x0f0f0f0f and   ebx,0xf0f0f0f0 add   eax,ebx
22 Mar 2006, 09:23
r22

Joined: 27 Dec 2004
Posts: 805
It's 25% slower than your previous algorithm Ivan, on my AMD box.
This is very odd. The timings are
~1343 ms for the non LUT algos
1734 ms for your newest one.

I suspect this block of code is the cause
shr ecx,2
add ecx,edx
and ecx,0x66666666
shr ecx,1
Four instructions writing/modifying the same register in a row is causing a partial stale bottleneck. For some reason the Pentium optimizes around this bottle neck while the AMD processor seems to choke on it.

This really makes you think about the validity of instruction level optimizations when the code can run on two totally different performing processors.
22 Mar 2006, 18:18
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------Blog General----------------MainDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsCompiler InternalsIDE DevelopmentOS ConstructionNon-x86 architecturesHigh Level LanguagesProgramming Language DesignProjects and IdeasExamples and Tutorials Other----------------FeedbackHeapTest Area
Goto page 1, 2  Next

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum

Copyright © 2004-2018, Tomasz Grysztar.
Powered by rwasa.