flat assembler
Message board for the users of flat assembler.

 Index > Main > Another one of these algos, this time: reverse_bits_32() Goto page 1, 2  Next
Author

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Code:
```        xor     eax,eax         ;p01   T=1
mov     ebx,12345678h   ;p01                    ;The constant that needs to be reversed

mov     ecx,ebx         ;p01   T=2
mov     edx,ebx         ;p01
and     ecx,0F0F0F0F0h  ;p01   T=3
and     edx,00F0F0F0Fh  ;p01
shr     ecx,4           ;p0    T=4
or      eax,ecx         ;p01
shl     edx,4           ;p0    T=5
or      eax,edx         ;p01

mov     ebx,eax         ;p01   T=6
mov     ecx,ebx         ;p01
mov     edx,ebx         ;p01   T=7
and     ecx,0CCCCCCCCh  ;p01
and     edx,033333333h  ;p01   T=8
shr     ecx,2           ;p0
mov     eax,ecx         ;p01   T=9
shl     edx,2           ;p0
or      eax,edx         ;p01   T=10

mov     ebx,eax         ;p01
mov     ecx,ebx         ;p01   T=11
mov     edx,ebx         ;p01
and     ecx,0AAAAAAAAh  ;p01   T=12
and     edx,055555555h  ;p01
shr     ecx,1           ;p0    T=13
mov     eax,ecx         ;p01
shl     edx,1           ;p0    T=14
or      eax,edx         ;p01

bswap   eax             ;p0+01 T=15-16
;117 clocks on a P!!! <= WHY?
;7-8 clocks on a P4 (PNI) <= This is what I expected from P!!!
```

_________________
My updated idol http://www.agner.org/optimize/
15 Mar 2006, 07:50
r22

Joined: 27 Dec 2004
Posts: 805
r22 16 Mar 2006, 05:16
Wouldn't running each byte through a lookup table be a little faster?.
Code:
```LUT:
db 0
repeat 255
db ((% and 1) shl 7) or ((% and 2) shl 5) or ((% and 4) shl 3) or ((% and 8) shl 1) or ((% and 16) shr 1) or ((% and 32) shr 3) or ((% and 64) shr 5) or ((% and 128) shr 7)
end repeat

;in EAX
;out EAX
reverse32:
xor edx,edx
mov dl,al
mov dl, byte[LUT+edx]
shr eax,8
mov cl,dl
mov dl,al
shl ecx,8
mov dl, byte[LUT+edx]
shr eax,8
mov cl,dl
mov dl,al
shl ecx,8
mov dl, byte[LUT+edx]
mov cl,dl
mov dl,ah
shl ecx,8
mov dl, byte[LUT+edx]
mov cl,dl
mov eax,ecx
```

Some optimization notes for the above, using or ecx,edx instead of mov cl,dl may be faster on newer processors, also movzx edx,byte[LUT+edx] may aslo be faster instead of mov dl,byte[LUT+edx].

Of course you'd need the added 256bytes for the LUT but the speed gain would be worth while. Depending on the memory constraints a word sized 65536 byte LUT may even be worth it.
16 Mar 2006, 05:16
Ivan2k2

Joined: 08 Sep 2004
Posts: 80
Location: Russia, Angarsk
Ivan2k2 16 Mar 2006, 05:46
my algo:

Code:
```
mov eax,12345678h
;;;;;;;;;;;;;;;;;;;;;;;;
mov ebx,eax
and eax,0x0f0f0f0f
and ebx,0xf0f0f0f0

shl eax,4
shr ebx,4
or  eax,ebx

mov ebx,eax
and eax,0x33333333
and ebx,0xcccccccc

shl eax,2
shr ebx,2
or  eax,ebx

mov ebx,eax
and eax,0x55555555
and ebx,0xaaaaaaaa

shl eax,1
shr ebx,1
or  eax,ebx

bswap eax
;;;;;;;;;;;;;;;;;;;;;;;;

```

it uses only two registers: EAX and EBX
16 Mar 2006, 05:46

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
@Ivan2k2:
Why I used more registers was because of the parallelism. With two registers, you can roughly do
clocks_taken=#_of_instructions/2
When assigning two other registers to do the job, you can generally divide it by four. And in your case - you need to drag the SHL/SHR apart because otherwise they just waste your clocks. Because you can't (as I see from the code) you may aswell replace shl1 => add, shl2 => add\add, shl4 => add\add\add and get one or two cycles off.
@r22:
Sorry, I didn't add that my goal was one-pass speed. Look-up tables are usables when the code needed to run through would be larger than the lookup-table. In this case, it wastes about 40-80 clocks to initialize the 8 parts of the table (32 bytes is one stride long). This lookup table isn't the most stable either. In the best case scenario it might contain 0 in the register and init only one 32-byte stride and happily finish in 10 or less clocks. I already investigated in the lookup table behaviour, but I didn't like it because of the lookup tables I thought I'd give a shot without them.
Another point against LUT is porting to, say, 64 bits. Then you need 8xbyte-conversions or 4xword-conversions. If you look at the instructions used - you can see that extending to 64 isn't much of a problem. My only concern it when 8-byte constants are used, the opcodes generated will be too long
16 Mar 2006, 11:38
r22

Joined: 27 Dec 2004
Posts: 805
r22 17 Mar 2006, 04:08
After some testing, I agree with you Madis, the LUT timings are slightly slower.

I cleaned up YOUR version a little and it currently runs 25% faster than your original.

Heres the snippet from the benchmarking code I use.
Code:
```        call    [GetCurrentProcess] ;returns -1
push    100h
push    eax
call    [SetPriorityClass]
push    15
push    eax

push 0
push 0
push 0
push 0
call [MessageBox]
;---------------------------------------
mov esi,07FFFFFFh
call [GetTickCount]
mov edi,eax
jmp tst1
align 16
tst1:
call Reverse1

dec esi
jnz tst1
push eax
push _fmth
call [printf]
call [GetTickCount]
sub eax,edi
push eax
push result1
call [printf]

mov esi,07FFFFFFh
call [GetTickCount]
mov edi,eax
jmp tst2
align 16
tst2:
call Reverse2

dec esi
jnz tst2
push eax
push _fmth
call [printf]
call [GetTickCount]
sub eax,edi
push eax
push result2
call [printf]
;+++++++++++++++++++++++++++++++++++++++++++++
push 0
push buffer
push buffer
push 0
call [MessageBox]
push 0
call [ExitProcess]

align 16
Reverse2:
xor     ecx,ecx
mov     eax,12345678h
bswap   eax
mov     ebx,eax
shr     eax,4
shl     ebx,4
and     eax, 0F0F0F0Fh
and     ebx,0F0F0F0F0h
or      ecx,eax
or      ecx,ebx
or      eax,ebx
and     ecx,0CCCCCCCCh
and     eax, 33333333h
shr     ecx,2
shl     eax,2
or      eax,ecx
mov     ebx,eax
and     eax,0AAAAAAAAh
and     ebx, 55555555h
shr     eax,1
shl     ebx,1
or     eax,ebx
ret     0

align 16
Reverse1:
xor     eax,eax
mov     ebx,12345678h
mov     ecx,ebx
mov     edx,ebx
and     ecx,0F0F0F0F0h
and     edx,00F0F0F0Fh
shr     ecx,4
or      eax,ecx
shl     edx,4
or      eax,edx
mov     ebx,eax
mov     ecx,ebx
mov     edx,ebx
and     ecx,0CCCCCCCCh
and     edx,033333333h
shr     ecx,2
mov     eax,ecx
shl     edx,2
or      eax,edx
mov     ebx,eax
mov     ecx,ebx
mov     edx,ebx
and     ecx,0AAAAAAAAh
and     edx,055555555h
shr     ecx,1
mov     eax,ecx
shl     edx,1
or      eax,edx
bswap   eax
ret     0
```
17 Mar 2006, 04:08
Ivan2k2

Joined: 08 Sep 2004
Posts: 80
Location: Russia, Angarsk
Ivan2k2 17 Mar 2006, 06:12
yet another code:

Code:
```        mov     eax,12345678h
bswap   eax
mov     ebx,eax
shr     eax,4
shl     ebx,4
and     eax, 0F0F0F0Fh
and     ebx,0F0F0F0F0h
;or      ecx,eax
;or      ecx,ebx
lea     ecx,[eax+ebx]
;or      eax,ebx
and     ecx,0CCCCCCCCh
and     eax, 33333333h
shr     ecx,2
lea     eax,[eax*4+ecx]
mov     ebx,eax
and     eax,0AAAAAAAAh
and     ebx, 55555555h
shr     eax,1
lea     eax,[eax+ebx*2]
```
17 Mar 2006, 06:12
r22

Joined: 27 Dec 2004
Posts: 805
r22 17 Mar 2006, 07:34
Nice, Ivan, your code is 8% faster on my benchmark. Using lea to combine a few instructions shows some definet speed improvement.

The only bottleneck is
lea ecx,[eax+ebx]

I made a few tweaks, and got rid of the EBX so it can be used as a function without worrying about register preservation.

It's only faster by an extra 5% (I just had to one-up you Ivan).

Code:
```        mov     eax,12345678h
bswap   eax
mov     edx,eax
shl     eax,4
shr     edx,4
and     eax,0F0F0F0F0h
and     edx, 0F0F0F0Fh
or      eax,edx ;;;;;;;;;;;;;;;;;;;;;;;;;;;;
mov     ecx,eax ;;;;;;;;;;;;;;;;;;;;;;;;;;;
and     eax, 33333333h
and     ecx,0CCCCCCCCh
shr     ecx,2
lea     eax,[eax*4+ecx]
mov     edx,eax
shr     eax,1
and     edx,055555555h
and     eax,055555555h
lea     eax,[eax+edx*2]
```

I'm running an AMD x2 3800+, I'd be interested if its faster or slower on a pentium.

Here's the benchmarking code with the two fastest algorithms
Code:
```;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
call    [GetCurrentProcess] ;returns -1
push    100h
push    eax
call    [SetPriorityClass]
push    15
push    eax

push 0
push 0
push 0
push 0
call [MessageBox]
;---------------------------------------
mov esi,0bFFFFFFh
call [GetTickCount]
mov edi,eax
jmp tst1
align 16
tst1:
call Reverse1

dec esi
jnz tst1
push eax
push _fmth
call [printf]
call [GetTickCount]
sub eax,edi
push eax
push result1
call [printf]

mov esi,0bFFFFFFh
call [GetTickCount]
mov edi,eax
jmp tst2
align 16
tst2:
call Reverse2

dec esi
jnz tst2
push eax
push _fmth
call [printf]
call [GetTickCount]
sub eax,edi
push eax
push result2
call [printf]
;+++++++++++++++++++++++++++++++++++++++++++++
push 0
push buffer
push buffer
push 0
call [MessageBox]
push 0
call [ExitProcess]

align 16
Reverse2:
mov     eax,12345678h
bswap   eax
mov     edx,eax
shl     eax,4
shr     edx,4
and     eax,0F0F0F0F0h
and     edx, 0F0F0F0Fh
or      eax,edx
mov     ecx,eax
and     eax, 33333333h
and     ecx,0CCCCCCCCh
shr     ecx,2
lea     eax,[eax*4+ecx]
mov     edx,eax
shr     eax,1
and     edx,055555555h
and     eax,055555555h
lea     eax,[eax+edx*2]
ret     0

align 16
Reverse1:
mov     eax,12345678h
bswap   eax
mov     ebx,eax
shr     eax,4
shl     ebx,4
and     eax, 0F0F0F0Fh
and     ebx,0F0F0F0F0h
;or      ecx,eax
;or      ecx,ebx
lea     ecx,[eax+ebx]
;or      eax,ebx
and     ecx,0CCCCCCCCh
and     eax, 33333333h
shr     ecx,2
lea     eax,[eax*4+ecx]
mov     ebx,eax
and     eax,0AAAAAAAAh
and     ebx, 55555555h
shr     eax,1
lea     eax,[eax+ebx*2]
ret     0
```
17 Mar 2006, 07:34

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
I tried your code and got the following results:
Code:
```Vastus:510274632

Vastus:510274632
Ivan+r22's try=1906 s

Vastus:510274632
ASM GEMS=1594 s
```

With THIS:
http://enos.itcollege.ee/~mkalme/PAHN/Up/cpu-2992.png

The last one is from http://www.df.lth.se/~john_e/gems/gem0018.html
There's also a population count algo for my liking there
17 Mar 2006, 09:22
r22

Joined: 27 Dec 2004
Posts: 805
r22 17 Mar 2006, 21:44
Interesting stuff. Just shows the pitfalls of optimizing algorithms. The ASM Gems runs at the exact timing as Ivan's tweaked version on my AMD cpu but on Madis' Intel cpu the gems version is a good deal faster.

I personally like the ASM Gem's version better, the XOR's instead of a second dword immediate value AND is much more efficient.

17 Mar 2006, 21:44
RedGhost

Joined: 18 May 2005
Posts: 443
RedGhost 18 Mar 2006, 02:02
I tried your code and got the following results:
Code:
```Vastus:510274632

Vastus:510274632
Ivan+r22's try=1906 s

Vastus:510274632
ASM GEMS=1594 s
```

With THIS:
http://enos.itcollege.ee/~mkalme/PAHN/Up/cpu-2992.png

The last one is from http://www.df.lth.se/~john_e/gems/gem0018.html
There's also a population count algo for my liking there

off-topic but do you have the pentium4 3.0ghz 64bit with hyper-threading? i think we have the same processor, but mine was 3.01 out of the box (yeah 0.01 ghz for free!)

_________________
redghost.ca
18 Mar 2006, 02:02

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
The 64-bit ones should be with 3.066Ghz and CPUz should say anything between 2.9-3.1 so the 10MHz is not anything to worry about. The also should have EIST that sometimes cuts off 200MHz (one ABit MoBo that I tested) on idle and maybe even more like around 1GHz (an example from laptops).

Mine isn't 64-bit as otherwise you should see the EM64T marking on my CPUz screenshot.

Another thing is that isn't your 64-bit CPU dual-core instead of HT?

BTW, r22, what kind of an AMD you have. How's it clocked and what other specs it has?

Any other types of benchmarks?

PS. I will test the instruction sequence at home on my P!!! laptop because I know that the P!!!s are the BEST in the history of computer science
20 Mar 2006, 09:26
Ivan2k2

Joined: 08 Sep 2004
Posts: 80
Location: Russia, Angarsk
Ivan2k2 20 Mar 2006, 12:03
Hello again...
Here my new variant of this algo, which is faster (on my CPU) than Madis`s algo
my cpu: p4 celeron 1.8 ghz

Code:
```mov eax,0x12345678
mov ebx,eax
mov ecx,eax
mov edx,ebx
and eax,0x88888888
and ecx,0x11111111
and ebx,0x44444444
and edx,0x22222222
shr eax,3
shr ebx,1
lea ebx,[ebx+edx*2]
lea eax,[eax+ecx*8]

bswap eax
mov   ebx,eax
shr   eax,4
shl   ebx,4
and   eax,0x0f0f0f0f
and   ebx,0xf0f0f0f0

```
20 Mar 2006, 12:03
r22

Joined: 27 Dec 2004
Posts: 805
r22 21 Mar 2006, 03:38
My cpu, 2ghz AMD x2 3800+

Ivan your new code runs at almost exactly the same speed, as your old code (that I modified) and the ASM Gem's code madis provided the link for.

Running the three in loops of 0FFFFFFFh they only vary by an average of 15ms. With no consistant winner.
21 Mar 2006, 03:38
Ivan2k2

Joined: 08 Sep 2004
Posts: 80
Location: Russia, Angarsk
Ivan2k2 21 Mar 2006, 03:47
2 r22: Are you sure that my algo produces wrong results ???

i checked it under debugger many times...
0x12345678 -> 0x1e6a2c48
0x9abcdef0 -> 0x0f7b3d59
0x1e6a2c48 -> 0x12345678
21 Mar 2006, 03:47
r22

Joined: 27 Dec 2004
Posts: 805
r22 21 Mar 2006, 04:45
It doesn't produce the wrong results it just doesn't run any faster in the benchmark.

The fastest solution for 32bit is still the 16bit LookUpTable
Code:
```        mov eax,12345678h
mov ebx,RevLUT
movzx ecx,ax
shr eax,16
movzx edx,word[ebx+ecx*2]
movzx eax,word[ebx+eax*2]
shl edx,16
or eax,edx
```

Runs 30% faster than any of the other algorithms.

Timings 0FFFFFFFh iterations
LUT Version 937ms
ThreeFastestNonLUT 1343-1360ms

The LUT costs the most memory 65536 words worth, but if top speed is what you're after it can't be beat for 32bit (It'll scale to 64bit as well as the 1byte LUT did in 32bit).

I wonder if there's an obscure floating point solution, like
mov eax,12345678h
fld eax
fmul [SomeMagicNumber]
fstp [esp-8]
mov eax,[esp-5?]
21 Mar 2006, 04:45

Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
There might be - I'm already studying it. I saw some logic in the:
0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F => 0,8,4,C,2,A,6,E,1,9,5,D,3,B,7,F

Its like this:
1st, 9th, 5th, 13th, 3rd, 11th, 7th, 15th, 2nd, 10th, 6th, 14th, 4th, 12th, 8th, 16th.

1 +8+12+8+6+8+12+8+6+8+12+8+6+etc.
taking N%16 of the result

You can get something like:
Xn=1+N/4*34+(8 if N && 1)+(20 if N && 2)

I think they can be done with conditional moves

if both bit 1 and 2 are up, then it combines to 3 and 28 is added.

Now I'm gonna go and assemble this

PS. above 8 you need to subtract 3 from the answer
21 Mar 2006, 08:12
Ivan2k2

Joined: 08 Sep 2004
Posts: 80
Location: Russia, Angarsk
Ivan2k2 21 Mar 2006, 10:39
hmm...
2 r22: maybe my code runs faster on intel p4 ???
21 Mar 2006, 10:39
RedGhost

Joined: 18 May 2005
Posts: 443
RedGhost 21 Mar 2006, 12:03
The 64-bit ones should be with 3.066Ghz and CPUz should say anything between 2.9-3.1 so the 10MHz is not anything to worry about. The also should have EIST that sometimes cuts off 200MHz (one ABit MoBo that I tested) on idle and maybe even more like around 1GHz (an example from laptops).

Mine isn't 64-bit as otherwise you should see the EM64T marking on my CPUz screenshot.

Another thing is that isn't your 64-bit CPU dual-core instead of HT?

BTW, r22, what kind of an AMD you have. How's it clocked and what other specs it has?

Any other types of benchmarks?

PS. I will test the instruction sequence at home on my P!!! laptop because I know that the P!!!s are the BEST in the history of computer science

ah didnt think EM64T would be listed

nope, single-core, just with HT

i remember my old p!!! 733mhz, and when i got it was the best out at the time and all my friends were jealous haha, i probably couldn't sell it for \$20 now >_>

_________________
redghost.ca
21 Mar 2006, 12:03
Ivan2k2

Joined: 08 Sep 2004
Posts: 80
Location: Russia, Angarsk
Ivan2k2 22 Mar 2006, 09:23
challenge continues... =)

next code a little faster than any last algos without LUT (on intel p4 celeron 1.8 )
Code:
```mov   eax,0x1e6a2c48
mov   ecx,eax
and   eax,0x33333333
and   ecx,0xcccccccc
mov   ebx,eax
mov   edx,ecx
lea   eax,[eax*4+ebx]
and   eax,0x66666666

shr   ecx,2
and   ecx,0x66666666
shr   ecx,1
lea   eax,[eax*2+ecx]

bswap eax
mov   ebx,eax
shr   eax,4
shl   ebx,4
and   eax,0x0f0f0f0f
and   ebx,0xf0f0f0f0
```
22 Mar 2006, 09:23
r22

Joined: 27 Dec 2004
Posts: 805
r22 22 Mar 2006, 18:18
It's 25% slower than your previous algorithm Ivan, on my AMD box.
This is very odd. The timings are
~1343 ms for the non LUT algos

I suspect this block of code is the cause
shr ecx,2
and ecx,0x66666666
shr ecx,1
Four instructions writing/modifying the same register in a row is causing a partial stale bottleneck. For some reason the Pentium optimizes around this bottle neck while the AMD processor seems to choke on it.

This really makes you think about the validity of instruction level optimizations when the code can run on two totally different performing processors.
22 Mar 2006, 18:18
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainTutorials and ExamplesDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsOS ConstructionIDE DevelopmentProjects and IdeasNon-x86 architecturesHigh Level LanguagesProgramming Language DesignCompiler Internals Other----------------FeedbackHeapTest Area
Goto page 1, 2  Next

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum