flat assembler
Message board for the users of flat assembler.
  
|  Index
      > Main > Reverse bit order of a single byte Goto page Previous 1, 2 | 
| Author | 
 | 
| ATV 29 Mar 2005, 11:42 More 32bit algos in http://www.df.lth.se/~john_e/gems/gem0018.html | |||
|  29 Mar 2005, 11:42 | 
 | 
| tom tobias 29 Mar 2005, 15:09 Thanks ATV for the two excellent references.  WHY is this procedure (32 bit reversal) useful?  When would you use it?  How is it superior to USING A LOT OF MEMORY?  Does someone imagine that bit reversal is somehow more elegant than employing a large quantity of memory to accomplish the same purpose?
 Isn't this thinking, i.e. the idea of finding a slick method to perform bit reversal efficiently, a RELIC of bygone days when memory was so precious? If MCD wishes to study computer engineering, he/she will profit from recognizing the need to FIRST DEFINE what one seeks to accomplish in writing a program, THEN explore various solutions, whether elegant or mundane. The real key to good computer engineering is to write PROGRAMS that are readily understood by anyone. Obscure, convoluted, arcane thinking, leading to a FAST, REALLY FAST, TERRIFICALLY FAST piece of CODE, that no one else can understand, is NOT ENGINEERING. It is babble. cheers  | |||
|  29 Mar 2005, 15:09 | 
 | 
| f0dder 29 Mar 2005, 15:41 Quote: 
 This is why you document your source code. *especially* when you use highly optimized assembly. Avoiding optimizations just to have "readable code" is nonsense. | |||
|  29 Mar 2005, 15:41 | 
 | 
| YONG 29 Mar 2005, 16:48 Well, ignoring the issue of speed, the above ideas are all
 very interesting. Here is my idea: Code: ; al = the input byte ; destroy ah, dl, dh mov ah, al mov dx, ax ; al = ah = dl = dh = the input byte and ax, 2211h and dx, 8844h ; al = 000?000?b ; ah = 00?000?0b ; dl = 0?000?00b ; dh = ?000?000b ror al, 1 ror ah, 3 rol dl, 3 rol dh, 1 ; al = ?000?000b ; ah = 0?000?00b ; dl = 00?000?0b ; dh = 000?000?b or ax, dx or al, ah ; al = the output byte Thank ATV, revolution, Madis731, MCD, tom tobias and f0dder for the input. YONG | |||
|  29 Mar 2005, 16:48 | 
 | 
| Madis731 29 Mar 2005, 18:54 It was an interesting topic and I looked at the gems - I loved the idea of using the same 256byte array 4 times on a DWORD    YONG Your idea is very simple but it takes 55 clocks though 32bit one takes 21 clocks and you can "and" your way to 8bits in one clock so you'll get 22 clocks  try that would you | |||
|  29 Mar 2005, 18:54 | 
 | 
| MCD 30 Mar 2005, 12:07 tom tobias wrote: WHY is this procedure (32 bit reversal) useful? When would you use it? Indeed I can't think of so many uses for this, but assume you have a serial (input) device which sends each single bits to a controler which merges them into a byte which then may be used further by a PC driver software. Imagine the controller puts them into the reversed order, than you will have to rereverse them in the driver code. Unfortunately, it seems that similar kind of throwing bits widely across hardware I/O ports is widely common. Well, you have logical and shift instructions to do this, but a shorter, easier and faster method which does the job in one step would be nice. _________________ MCD - the inevitable return of the Mad Computer Doggy -||__/ .|+-~ .|| || | |||
|  30 Mar 2005, 12:07 | 
 | 
| YONG 30 Mar 2005, 12:33 Here are the 16-bit & the 32-bit versions of my previous idea:
 Code: ; ; 8-bit input - 10 instructions - shown earlier ; ; ; 16-bit input - 19 instructions ; ; ax = the input word ; destroy bx, cx, dx mov bx, ax mov cx, ax mov dx, ax ; ax = bx = cx = dx = the input word and ax, 1111h and bx, 2222h and cx, 4444h and dx, 8888h ror al, 1 ror ah, 1 ror bl, 3 ror bh, 3 rol cl, 3 rol ch, 3 rol dl, 1 rol dh, 1 or ax, dx or ax, cx or ax, bx ; Still requires to swap al and ah xchg al, ah ; ax = the output word ; ; 32-bit input - 30 instructions ; ; eax = the input dword ; destroy ebx, ecx, edx mov ebx, eax mov ecx, eax mov edx, eax ; eax = ebx = ecx = edx ; = the input dword and eax, 11111111h and ebx, 22222222h and ecx, 44444444h and edx, 88888888h ror al, 1 ror ah, 1 bswap eax ror al, 1 ror ah, 1 ror bl, 3 ror bh, 3 bswap ebx ror bl, 3 ror bh, 3 rol cl, 3 rol ch, 3 bswap ecx rol cl, 3 rol ch, 3 rol dl, 1 rol dh, 1 bswap edx rol dl, 1 rol dh, 1 or eax, edx or eax, ecx or eax, ebx ; eax = the output dword Well, one use of bit reversal, whether it is 8-bit, 16-bit or 32-bit, is in encryption/decryption algorithms, which acts as a simple, reversible byte/word/dword substitution mechanism. Thank Madis731 & MCD for the input. YONG | |||
|  30 Mar 2005, 12:33 | 
 | 
| Matrix 30 Mar 2005, 19:02 hello,
 YONG whould you test this? Code: bitswap32: mov cx,32 .l666: shr eax,1 rcl ebx,1 loopw .l666 mov eax,ebx ret i'm not sure in that bswap thing is faster ... | |||
|  30 Mar 2005, 19:02 | 
 | 
| YONG 31 Mar 2005, 09:45 Well, this is like a beauty contest and the contestants are:
 Loop1: Matrix's idea / idea from the win32asm forum Loop2: Madis731's idea Loop3: My stupid idea Loop4: Table-lookup approach - Within this approach, there are many legs: - with xlatb - without xlatb - align 256 - no alignment - Let's use the simplest way: without xlatb, no alignment Assumptions: - the input dword: eax - free to use ebx, ecx & edx, i.e., no need to push/pop - the output dword: one of eax, ebx, ecx & edx Code: ; ; 29 bytes ; mov di, buffer Loop1: mov eax, dword [di] mov cx, 32 innerLp: shr eax, 1 rcl ebx, 1 loop innerLp mov dword [di], ebx add di, 4 cmp di, buffer+bufferSize jb Loop1 bufferSize = 64 buffer rb bufferSize ; ; 110 bytes ; - Size can be reduced by making routine calls ; mov di, buffer Loop2: mov eax, dword [di] mov dx, ax and ax, 0aaaah ; ax = ?0?0?0?0?0?0?0?0 and dx, 05555h ; dx = 0?0?0?0?0?0?0?0? ror ax, 2 or ax, dx mov dx, ax and ax, 06666h ; ax = 0??00??00??00??0 and dx, 09999h ; dx = ?00??00??00??00? ror ax, 4 or ax, dx mov dx, ax and ax, 01e1eh ; ax = 000????0000????0 and dx, 0e1e1h ; dx = ???0000????0000? ror ax, 8 or ax, dx ror ax, 1 ror eax, 16 mov dx, ax and ax, 0aaaah ; ax = ?0?0?0?0?0?0?0?0 and dx, 05555h ; dx = 0?0?0?0?0?0?0?0? ror ax, 2 or ax, dx mov dx, ax and ax, 06666h ; ax = 0??00??00??00??0 and dx, 09999h ; dx = ?00??00??00??00? ror ax, 4 or ax, dx mov dx, ax and ax, 01e1eh ; ax = 000????0000????0 and dx, 0e1e1h ; dx = ???0000????0000? ror ax, 8 or ax, dx ror ax, 1 ; ror eax, 16 ; <= bugfix by YONG mov dword [di], eax add di, 4 cmp di, buffer+bufferSize jb Loop2 bufferSize = 64 buffer rb bufferSize ; ; 115 bytes ; mov di, buffer Loop3: mov eax, dword [di] ; eax = the input dword mov ebx, eax mov ecx, eax mov edx, eax ; eax = ebx = ecx = edx ; = the input dword and eax, 11111111h and ebx, 22222222h and ecx, 44444444h and edx, 88888888h ror al, 1 ror ah, 1 bswap eax ror al, 1 ror ah, 1 ror bl, 3 ror bh, 3 bswap ebx ror bl, 3 ror bh, 3 rol cl, 3 rol ch, 3 bswap ecx rol cl, 3 rol ch, 3 rol dl, 1 rol dh, 1 bswap edx rol dl, 1 rol dh, 1 or eax, edx or eax, ecx or eax, ebx ; eax = the output dword mov dword [di], eax add di, 4 cmp di, buffer+bufferSize jb Loop3 bufferSize = 64 buffer rb bufferSize ; ; 45 bytes (+ 256 bytes for the lookup table) ; xor bh, bh mov di, buffer Loop4: mov eax, dword [di] mov bl, al mov al, [lookup+bx] mov bl, ah mov ah, [lookup+bx] bswap eax mov bl, ah mov ah, [lookup+bx] mov bl, al mov al, [lookup+bx] mov dword [di], eax add di, 4 cmp di, buffer+bufferSize jb Loop4 lookup db 00h, ..., 0ffh ; 256 bytes bufferSize = 64 buffer rb bufferSize Average time taken to finish 9.68MB of data: Loop1: 31 seconds =(30+32)/2 ; one encryption + one decryption Loop2: 18.5 seconds =(18+19)/2 Loop3: 18.5 seconds =(18+19)/2 Loop4: 15.5 seconds =(15+16)/2 To summarize: - the fastest: Loop4 - however, it is also the biggest (due to the lookup table) - the shortest and the most flexible: Loop1 - however, it is rather slow - the other ideas are somewhere in the middle of the spectrum Thank Matrix for the tough challenge. No more testing!!! YONG | |||
|  31 Mar 2005, 09:45 | 
 | 
| revolution 01 Apr 2005, 01:47 But you didn't fully extend the 16 bit code to 32 bit.
 Code: bitswap_eax: mov edx,eax and eax,0aaaaaaaah and edx,055555555h ror eax,2 or eax,edx mov edx,eax and eax,066666666h and edx,099999999h ror eax,4 or eax,edx mov edx,eax and eax,01e1e1e1eh and edx,0e1e1e1e1h ror eax,8 or eax,edx mov edx,eax and eax,001fe01feh and edx,0fe01fe01h ror eax,16 or eax,edx ror eax,1 ret | |||
|  01 Apr 2005, 01:47 | 
 | 
| YONG 01 Apr 2005, 09:07 Loop2a: revolution's suggestion
 Code: ; ; 113 bytes ; mov di, buffer Loop2a: mov eax, dword [di] mov edx, eax and eax, 0aaaaaaaah and edx, 055555555h ror eax, 2 or eax, edx mov edx, eax and eax, 066666666h and edx, 099999999h ror eax, 4 or eax, edx mov edx, eax and eax, 01e1e1e1eh and edx, 0e1e1e1e1h ror eax, 8 or eax, edx mov edx, eax and eax, 001fe01feh and edx, 0fe01fe01h ror eax, 16 or eax, edx ror eax, 1 mov dword [di], eax add di, 4 cmp di, buffer+bufferSize jb Loop2a bufferSize = 64 buffer rb bufferSize Average time taken - Loop2a: 17.5 seconds = ( 17 + 18 ) / 2 - faster (but also slightly bigger) than Loop2 Thank revolution for the input. YONG | |||
|  01 Apr 2005, 09:07 | 
 | 
| revolution 04 Apr 2005, 07:48 Just for fun:
 MMX version Code: align 8 c5555 dq 05555555555555555h c3333 dq 03333333333333333h c0f0f dq 00f0f0f0f0f0f0f0fh c00ff dq 000ff00ff00ff00ffh bitswap_mm0: movq mm1,mm0 psrlq mm0,1 pand mm0,[c5555] pand mm1,[c5555] psllq mm1,1 por mm0,mm1 movq mm1,mm0 psrlq mm0,2 pand mm0,[c3333] pand mm1,[c3333] psllq mm1,2 por mm0,mm1 movq mm1,mm0 psrlq mm0,4 pand mm0,[c0f0f] pand mm1,[c0f0f] psllq mm1,4 por mm0,mm1 movq mm1,mm0 psrlq mm0,8 pand mm0,[c00ff] pand mm1,[c00ff] psllq mm1,8 por mm0,mm1 pshufw mm0,mm0,0*64+1*16+2*4+3 ret XMM version Code: align 16 d5555 dq 05555555555555555h,05555555555555555h d3333 dq 03333333333333333h,03333333333333333h d0f0f dq 00f0f0f0f0f0f0f0fh,00f0f0f0f0f0f0f0fh d00ff dq 000ff00ff00ff00ffh,000ff00ff00ff00ffh bitswap_xmm0: movdqa xmm1,xmm0 psrlq xmm0,1 pand xmm0,dqword[d5555] pand xmm1,dqword[d5555] psllq xmm1,1 por xmm0,xmm1 movdqa xmm1,xmm0 psrlq xmm0,2 pand xmm0,dqword[d3333] pand xmm1,dqword[d3333] psllq xmm1,2 por xmm0,xmm1 movdqa xmm1,xmm0 psrlq xmm0,4 pand xmm0,dqword[d0f0f] pand xmm1,dqword[d0f0f] psllq xmm1,4 por xmm0,xmm1 movdqa xmm1,xmm0 psrlq xmm0,8 pand xmm0,dqword[d00ff] pand xmm1,dqword[d00ff] psllq xmm1,8 por xmm0,xmm1 pshuflw xmm0,xmm0,0*64+1*16+2*4+3 pshufhw xmm0,xmm0,0*64+1*16+2*4+3 pshufd xmm0,xmm0,1*64+0*16+3*4+2 ret | |||
|  04 Apr 2005, 07:48 | 
 | 
| MCD 04 Apr 2005, 12:26 wicked!   | |||
|  04 Apr 2005, 12:26 | 
 | 
| Goto page  Previous  1, 2 < Last Thread | Next Thread > | 
| Forum Rules: 
 | 
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.