flat assembler
Message board for the users of flat assembler.
  
|  Index
      > Main > Another one of these algos, this time: reverse_bits_32() Goto page 1, 2 Next | 
| Author | 
 | 
| r22 16 Mar 2006, 05:16 Wouldn't running each byte through a lookup table be a little faster?.
 Code: LUT: db 0 repeat 255 db ((% and 1) shl 7) or ((% and 2) shl 5) or ((% and 4) shl 3) or ((% and 8) shl 1) or ((% and 16) shr 1) or ((% and 32) shr 3) or ((% and 64) shr 5) or ((% and 128) shr 7) end repeat ;in EAX ;out EAX reverse32: xor edx,edx mov dl,al mov dl, byte[LUT+edx] shr eax,8 mov cl,dl mov dl,al shl ecx,8 mov dl, byte[LUT+edx] shr eax,8 mov cl,dl mov dl,al shl ecx,8 mov dl, byte[LUT+edx] mov cl,dl mov dl,ah shl ecx,8 mov dl, byte[LUT+edx] mov cl,dl mov eax,ecx Some optimization notes for the above, using or ecx,edx instead of mov cl,dl may be faster on newer processors, also movzx edx,byte[LUT+edx] may aslo be faster instead of mov dl,byte[LUT+edx]. Of course you'd need the added 256bytes for the LUT but the speed gain would be worth while. Depending on the memory constraints a word sized 65536 byte LUT may even be worth it. | |||
|  16 Mar 2006, 05:16 | 
 | 
| Ivan2k2 16 Mar 2006, 05:46 my algo:
 Code: mov eax,12345678h ;;;;;;;;;;;;;;;;;;;;;;;; mov ebx,eax and eax,0x0f0f0f0f and ebx,0xf0f0f0f0 shl eax,4 shr ebx,4 or eax,ebx mov ebx,eax and eax,0x33333333 and ebx,0xcccccccc shl eax,2 shr ebx,2 or eax,ebx mov ebx,eax and eax,0x55555555 and ebx,0xaaaaaaaa shl eax,1 shr ebx,1 or eax,ebx bswap eax ;;;;;;;;;;;;;;;;;;;;;;;; it uses only two registers: EAX and EBX | |||
|  16 Mar 2006, 05:46 | 
 | 
| Madis731 16 Mar 2006, 11:38 @Ivan2k2:
 Why I used more registers was because of the parallelism. With two registers, you can roughly do clocks_taken=#_of_instructions/2 When assigning two other registers to do the job, you can generally divide it by four. And in your case - you need to drag the SHL/SHR apart because otherwise they just waste your clocks. Because you can't (as I see from the code) you may aswell replace shl1 => add, shl2 => add\add, shl4 => add\add\add and get one or two cycles off.   @r22: Sorry, I didn't add that my goal was one-pass speed. Look-up tables are usables when the code needed to run through would be larger than the lookup-table. In this case, it wastes about 40-80 clocks to initialize the 8 parts of the table (32 bytes is one stride long). This lookup table isn't the most stable either. In the best case scenario it might contain 0 in the register and init only one 32-byte stride and happily finish in 10 or less clocks. I already investigated in the lookup table behaviour, but I didn't like it because of the lookup tables  I thought I'd give a shot without them. Another point against LUT is porting to, say, 64 bits. Then you need 8xbyte-conversions or 4xword-conversions. If you look at the instructions used - you can see that extending to 64 isn't much of a problem. My only concern it when 8-byte constants are used, the opcodes generated will be too long  | |||
|  16 Mar 2006, 11:38 | 
 | 
| r22 17 Mar 2006, 04:08 After some testing, I agree with you Madis, the LUT timings are slightly slower.
 I cleaned up YOUR version a little and it currently runs 25% faster than your original. Heres the snippet from the benchmarking code I use. Code: call [GetCurrentProcess] ;returns -1 push 100h push eax call [SetPriorityClass] call [GetCurrentThread];;returns -2 push 15 push eax call [SetThreadPriority] push 0 push 0 push 0 push 0 call [MessageBox] ;--------------------------------------- mov esi,07FFFFFFh call [GetTickCount] mov edi,eax jmp tst1 align 16 tst1: call Reverse1 dec esi jnz tst1 push eax push _fmth call [printf] call [GetTickCount] sub eax,edi push eax push result1 call [printf] mov esi,07FFFFFFh call [GetTickCount] mov edi,eax jmp tst2 align 16 tst2: call Reverse2 dec esi jnz tst2 push eax push _fmth call [printf] call [GetTickCount] sub eax,edi push eax push result2 call [printf] ;+++++++++++++++++++++++++++++++++++++++++++++ push 0 push buffer push buffer push 0 call [MessageBox] push 0 call [ExitProcess] align 16 Reverse2: xor ecx,ecx mov eax,12345678h bswap eax mov ebx,eax shr eax,4 shl ebx,4 and eax, 0F0F0F0Fh and ebx,0F0F0F0F0h or ecx,eax or ecx,ebx or eax,ebx and ecx,0CCCCCCCCh and eax, 33333333h shr ecx,2 shl eax,2 or eax,ecx mov ebx,eax and eax,0AAAAAAAAh and ebx, 55555555h shr eax,1 shl ebx,1 or eax,ebx ret 0 align 16 Reverse1: xor eax,eax mov ebx,12345678h mov ecx,ebx mov edx,ebx and ecx,0F0F0F0F0h and edx,00F0F0F0Fh shr ecx,4 or eax,ecx shl edx,4 or eax,edx mov ebx,eax mov ecx,ebx mov edx,ebx and ecx,0CCCCCCCCh and edx,033333333h shr ecx,2 mov eax,ecx shl edx,2 or eax,edx mov ebx,eax mov ecx,ebx mov edx,ebx and ecx,0AAAAAAAAh and edx,055555555h shr ecx,1 mov eax,ecx shl edx,1 or eax,edx bswap eax ret 0 | |||
|  17 Mar 2006, 04:08 | 
 | 
| Ivan2k2 17 Mar 2006, 06:12 yet another code:
 Code: mov eax,12345678h bswap eax mov ebx,eax shr eax,4 shl ebx,4 and eax, 0F0F0F0Fh and ebx,0F0F0F0F0h ;or ecx,eax ;or ecx,ebx lea ecx,[eax+ebx] ;or eax,ebx add eax,ebx and ecx,0CCCCCCCCh and eax, 33333333h shr ecx,2 lea eax,[eax*4+ecx] mov ebx,eax and eax,0AAAAAAAAh and ebx, 55555555h shr eax,1 lea eax,[eax+ebx*2] | |||
|  17 Mar 2006, 06:12 | 
 | 
| r22 17 Mar 2006, 07:34 Nice, Ivan, your code is 8% faster on my benchmark. Using lea to combine a few instructions shows some definet speed improvement.
 The only bottleneck is lea ecx,[eax+ebx] add eax,ebx I made a few tweaks, and got rid of the EBX so it can be used as a function without worrying about register preservation. It's only faster by an extra 5% (I just had to one-up you Ivan). Code: mov eax,12345678h bswap eax mov edx,eax shl eax,4 shr edx,4 and eax,0F0F0F0F0h and edx, 0F0F0F0Fh or eax,edx ;;;;;;;;;;;;;;;;;;;;;;;;;;;; mov ecx,eax ;;;;;;;;;;;;;;;;;;;;;;;;;;; and eax, 33333333h and ecx,0CCCCCCCCh shr ecx,2 lea eax,[eax*4+ecx] mov edx,eax shr eax,1 and edx,055555555h and eax,055555555h lea eax,[eax+edx*2] I'm running an AMD x2 3800+, I'd be interested if its faster or slower on a pentium. Here's the benchmarking code with the two fastest algorithms Code: ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; call [GetCurrentProcess] ;returns -1 push 100h push eax call [SetPriorityClass] call [GetCurrentThread];;returns -2 push 15 push eax call [SetThreadPriority] push 0 push 0 push 0 push 0 call [MessageBox] ;--------------------------------------- mov esi,0bFFFFFFh call [GetTickCount] mov edi,eax jmp tst1 align 16 tst1: call Reverse1 dec esi jnz tst1 push eax push _fmth call [printf] call [GetTickCount] sub eax,edi push eax push result1 call [printf] mov esi,0bFFFFFFh call [GetTickCount] mov edi,eax jmp tst2 align 16 tst2: call Reverse2 dec esi jnz tst2 push eax push _fmth call [printf] call [GetTickCount] sub eax,edi push eax push result2 call [printf] ;+++++++++++++++++++++++++++++++++++++++++++++ push 0 push buffer push buffer push 0 call [MessageBox] push 0 call [ExitProcess] align 16 Reverse2: mov eax,12345678h bswap eax mov edx,eax shl eax,4 shr edx,4 and eax,0F0F0F0F0h and edx, 0F0F0F0Fh or eax,edx mov ecx,eax and eax, 33333333h and ecx,0CCCCCCCCh shr ecx,2 lea eax,[eax*4+ecx] mov edx,eax shr eax,1 and edx,055555555h and eax,055555555h lea eax,[eax+edx*2] ret 0 align 16 Reverse1: mov eax,12345678h bswap eax mov ebx,eax shr eax,4 shl ebx,4 and eax, 0F0F0F0Fh and ebx,0F0F0F0F0h ;or ecx,eax ;or ecx,ebx lea ecx,[eax+ebx] ;or eax,ebx add eax,ebx and ecx,0CCCCCCCCh and eax, 33333333h shr ecx,2 lea eax,[eax*4+ecx] mov ebx,eax and eax,0AAAAAAAAh and ebx, 55555555h shr eax,1 lea eax,[eax+ebx*2] ret 0 | |||
|  17 Mar 2006, 07:34 | 
 | 
| Madis731 17 Mar 2006, 09:22 I tried your code and got the following results:
 Code: Vastus:510274632 Madis's version=1891 s Vastus:510274632 Ivan+r22's try=1906 s Vastus:510274632 ASM GEMS=1594 s With THIS: http://enos.itcollege.ee/~mkalme/PAHN/Up/cpu-2992.png The last one is from http://www.df.lth.se/~john_e/gems/gem0018.html There's also a population count algo for my liking there  | |||
|  17 Mar 2006, 09:22 | 
 | 
| r22 17 Mar 2006, 21:44 Interesting stuff. Just shows the pitfalls of optimizing algorithms. The ASM Gems runs at the exact timing as Ivan's tweaked version on my AMD cpu but on Madis' Intel cpu the gems version is a good deal faster.
 I personally like the ASM Gem's version better, the XOR's instead of a second dword immediate value AND is much more efficient. Interesting thread Madis | |||
|  17 Mar 2006, 21:44 | 
 | 
| RedGhost 18 Mar 2006, 02:02 Madis731 wrote: I tried your code and got the following results: off-topic but do you have the pentium4 3.0ghz 64bit with hyper-threading? i think we have the same processor, but mine was 3.01 out of the box (yeah 0.01 ghz for free!) _________________ redghost.ca | |||
|  18 Mar 2006, 02:02 | 
 | 
| Madis731 20 Mar 2006, 09:26 The 64-bit ones should be with 3.066Ghz and CPUz should say anything between 2.9-3.1 so the 10MHz is not anything to worry about. The also should have EIST that sometimes cuts off 200MHz (one ABit MoBo that I tested) on idle and maybe even more like around 1GHz (an example from laptops).
 Mine isn't 64-bit as otherwise you should see the EM64T marking on my CPUz screenshot. Another thing is that isn't your 64-bit CPU dual-core instead of HT? BTW, r22, what kind of an AMD you have. How's it clocked and what other specs it has? Any other types of benchmarks? PS. I will test the instruction sequence at home on my P!!! laptop because I know that the P!!!s are the BEST in the history of computer science  | |||
|  20 Mar 2006, 09:26 | 
 | 
| Ivan2k2 20 Mar 2006, 12:03 Hello again...
 Here my new variant of this algo, which is faster (on my CPU) than Madis`s algo my cpu: p4 celeron 1.8 ghz Code: mov eax,0x12345678 mov ebx,eax mov ecx,eax mov edx,ebx and eax,0x88888888 and ecx,0x11111111 and ebx,0x44444444 and edx,0x22222222 shr eax,3 shr ebx,1 lea ebx,[ebx+edx*2] lea eax,[eax+ecx*8] add eax,ebx bswap eax mov ebx,eax shr eax,4 shl ebx,4 and eax,0x0f0f0f0f and ebx,0xf0f0f0f0 add eax,ebx | |||
|  20 Mar 2006, 12:03 | 
 | 
| r22 21 Mar 2006, 03:38 My cpu, 2ghz AMD x2 3800+
 Ivan your new code runs at almost exactly the same speed, as your old code (that I modified) and the ASM Gem's code madis provided the link for. Running the three in loops of 0FFFFFFFh they only vary by an average of 15ms. With no consistant winner. | |||
|  21 Mar 2006, 03:38 | 
 | 
| Ivan2k2 21 Mar 2006, 03:47 2 r22: Are you sure that my algo produces wrong results ???
 i checked it under debugger many times... 0x12345678 -> 0x1e6a2c48 0x9abcdef0 -> 0x0f7b3d59 0x1e6a2c48 -> 0x12345678 | |||
|  21 Mar 2006, 03:47 | 
 | 
| r22 21 Mar 2006, 04:45 It doesn't produce the wrong results it just doesn't run any faster in the benchmark.
 The fastest solution for 32bit is still the 16bit LookUpTable Code: mov eax,12345678h mov ebx,RevLUT movzx ecx,ax shr eax,16 movzx edx,word[ebx+ecx*2] movzx eax,word[ebx+eax*2] shl edx,16 or eax,edx Runs 30% faster than any of the other algorithms. Timings 0FFFFFFFh iterations LUT Version 937ms ThreeFastestNonLUT 1343-1360ms The LUT costs the most memory 65536 words worth, but if top speed is what you're after it can't be beat for 32bit (It'll scale to 64bit as well as the 1byte LUT did in 32bit). I wonder if there's an obscure floating point solution, like mov eax,12345678h fld eax fmul [SomeMagicNumber] fstp [esp-8] mov eax,[esp-5?] | |||
|  21 Mar 2006, 04:45 | 
 | 
| Madis731 21 Mar 2006, 08:12 There might be - I'm already studying it. I saw some logic in the:
 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F => 0,8,4,C,2,A,6,E,1,9,5,D,3,B,7,F Its like this: 1st, 9th, 5th, 13th, 3rd, 11th, 7th, 15th, 2nd, 10th, 6th, 14th, 4th, 12th, 8th, 16th. 1 +8+12+8+6+8+12+8+6+8+12+8+6+etc. taking N%16 of the result You can get something like: Xn=1+N/4*34+(8 if N && 1)+(20 if N && 2) I think they can be done with conditional moves if both bit 1 and 2 are up, then it combines to 3 and 28 is added. Now I'm gonna go and assemble this   PS. above 8 you need to subtract 3 from the answer | |||
|  21 Mar 2006, 08:12 | 
 | 
| Ivan2k2 21 Mar 2006, 10:39 hmm...
 2 Madis: i'll try your idea too.... =))))) 2 r22: maybe my code runs faster on intel p4 ??? | |||
|  21 Mar 2006, 10:39 | 
 | 
| RedGhost 21 Mar 2006, 12:03 Madis731 wrote: The 64-bit ones should be with 3.066Ghz and CPUz should say anything between 2.9-3.1 so the 10MHz is not anything to worry about. The also should have EIST that sometimes cuts off 200MHz (one ABit MoBo that I tested) on idle and maybe even more like around 1GHz (an example from laptops). ah didnt think EM64T would be listed nope, single-core, just with HT i remember my old p!!! 733mhz, and when i got it was the best out at the time and all my friends were jealous haha, i probably couldn't sell it for $20 now >_> _________________ redghost.ca | |||
|  21 Mar 2006, 12:03 | 
 | 
| Ivan2k2 22 Mar 2006, 09:23 challenge continues...  =)
 next code a little faster than any last algos without LUT (on intel p4 celeron 1.8 ) what about amd ??? Code: mov eax,0x1e6a2c48 mov ecx,eax and eax,0x33333333 and ecx,0xcccccccc mov ebx,eax mov edx,ecx lea eax,[eax*4+ebx] and eax,0x66666666 shr ecx,2 add ecx,edx and ecx,0x66666666 shr ecx,1 lea eax,[eax*2+ecx] bswap eax mov ebx,eax shr eax,4 shl ebx,4 and eax,0x0f0f0f0f and ebx,0xf0f0f0f0 add eax,ebx | |||
|  22 Mar 2006, 09:23 | 
 | 
| r22 22 Mar 2006, 18:18 It's 25% slower than your previous algorithm Ivan, on my AMD box.
 This is very odd. The timings are ~1343 ms for the non LUT algos 1734 ms for your newest one. I suspect this block of code is the cause shr ecx,2 add ecx,edx and ecx,0x66666666 shr ecx,1 Four instructions writing/modifying the same register in a row is causing a partial stale bottleneck. For some reason the Pentium optimizes around this bottle neck while the AMD processor seems to choke on it. This really makes you think about the validity of instruction level optimizations when the code can run on two totally different performing processors. | |||
|  22 Mar 2006, 18:18 | 
 | 
| Goto page 1, 2  Next < Last Thread | Next Thread > | 
| Forum Rules: 
 | 
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.