flat assembler
Message board for the users of flat assembler.
![]() Goto page 1, 2 Next |
Author |
|
ronware
The fastest is to simply have a lookup table with the correct values. If you can spare 256 bytes, this is hard to beat.
|
|||
![]() |
|
YONG
The lookup table approach is a bit faster.
Thank ronware for the input. YONG |
|||
![]() |
|
rea
Search the win32asm forum there is one interesting in the algorithm section
![]() |
|||
![]() |
|
r22
Code: Algorithm (SLOW) ;AL = 11101100b MOV ECX,1 SHRD DX,AX,CL ROL DX,1 SHR AX,1 SHRD DX,AX,CL ROL DX,1 SHR AX,1 SHRD DX,AX,CL ROL DX,1 SHR AX,1 SHRD DX,AX,CL ROL DX,1 SHR AX,1 SHRD DX,AX,CL ROL DX,1 SHR AX,1 SHRD DX,AX,CL ROL DX,1 SHR AX,1 SHRD DX,AX,CL ROL DX,1 SHR AX,1 SHRD DX,AX,CL ROL DX,1 SHR AX,1 MOV AL,DL ;AL = 00110111b LookUp Table (FASTEST) label: db 00000000b, 10000000b, 01000000b ... etc 256bytes MOVZX EAX,AL MOV AL, BYTE [LABEL+EAX] |
|||
![]() |
|
Matrix
Hello
Code: bitswap: ; swap bits in al mov ah,al xor al,al mov cx,8 @l0: shl al,1 shr ah,1 adc al,0 loop @l0 ret bitswap: ; swap bits in al mov cx,8 @l0: shl al,1 rcr ah,1 loop @l0 mov al,ah ret Techno Forever !!!! regards ... |
|||
![]() |
|
YONG
Idea from the win32asm forum:
Loop 8 times - Shift right (shr) the input byte by 1 bit => Carry, and then - Rotate through carry left (rcl) the output byte by 1 bit. This approach is very flexible since it can easily deal with 16-bit or 32-bit input. However, I don't think this would be very fast, because there are two bit-shifting operations per loop and eight loops are required for a single byte. Thank rea, r22 and Matrix for the input. YONG |
|||
![]() |
|
MCD
YONG wrote: Idea from the win32asm forum: Code: label: db 00000000b, 10000000b, 01000000b; ... etc 256bytes reverse_word_ax: push edx ecx movzx edx,al mov ch,[label+edx] mov dl,ah mov cl,[label+edx] mov ax,cx pop ecx edx ret reverse_dword_eax: push edx ecx movzx edx,al mov ch,[label+edx] mov dl,ah mov cl,[label+edx] shr eax,16 shl ecx,16 mov dl,al mov ch,[label+edx] mov dl,ah mov cl,[label+edx] mov eax,ecx pop ecx edx ret _________________ MCD - the inevitable return of the Mad Computer Doggy -||__/ .|+-~ .|| || Last edited by MCD on 29 Mar 2005, 13:14; edited 1 time in total |
|||
![]() |
|
r22
The only worth while way seems to be the lookup table.
I wish there were MMX/XMMX instruction where you could do multiple lookups at the same time that would be awsum. MM0 = 8 packed bytes PLTB MM0,ESI <= something like that and return the result bytes in mm0 ;MM0 = 8 packed return bytes from byte[ESI+Byte] It's a shame processor engineers don't think like me ![]() |
|||
![]() |
|
YONG
It seems that the lookup table approach is the best if
there is enough room for the table. Here is the lookup table for those who may also want to use this approach. Thank MCD and r22 for the input. YONG
|
|||||||||||
![]() |
|
Octavio
this should be shorter and faster.
Code: reverse_dword_eax: push ebx mov ebx,table xlat mov bl,ah mov ah,[ebx] bswap eax xlat mov bl,ah mov ah,[ebx] pop ebx ret aling 256 table: db 0,10000000b ...... |
|||
![]() |
|
YONG
Just want to point out two tiny bugs.
Code by MCD: reverse_dword_eax: push eax edx ecx ... mov eax,ecx pop ecx edx eax ret Since eax is used as an input/output variable, pushing it at the beginning is unnecessary and popping it at the end replaces its output value by its input value. Code by Octavio: reverse_dword_eax: ... xlat ... xlat ... ret Since xlat requires a "dummy" operand, use xlatb instead, which requires no operand. Thank Octavio for the input. YONG |
|||
![]() |
|
Octavio
There was a bug ,xlatb can not be used because bl is modified.
Code: reverse_dword_eax: push ebx mov ebx,table mov bl,al mov al,[ebx] mov bl,ah mov ah,[ebx] bswap eax mov bl,al mov al,[ebx] mov bl,ah mov ah,[ebx] pop ebx ret aling 256 table: db 0,10000000b ...... |
|||
![]() |
|
r22
XLAT like the string instructions is SLOW just use
mov ebx, Table movzx eax,al ;the byte mov al, [ebx+eax] if you have an array of bytes that need reversing Code: xor EAX,EAX mov ESI, Array mov EDI, RevArray mov EBX, Table mov ECX, [Length] sub ECX,1 .theLoop: JS .allDone ;when ECX is negative mov AL, byte[ESI+ECX] mov AL, byte[EBX+EAX] mov byte[EDI+ECX],AL sub ECX,1 jmp .theLoop .allDone: |
|||
![]() |
|
YONG
I tested the following two loops, one using xlatb
and the other without xlatb. The two virtually took the same amount of time to finish 1.8MB of data. The xlatb loop is even a bit shorter. Code: ; ; 20 bytes xor bh, bh mov di, buffer Loop1: mov bl, [di] mov al, [lookup+bx] mov [di], al inc di cmp di, buffer+bufferSize jb Loop1 ; ; 18 bytes mov bx, lookup mov di, buffer Loop2: mov al, [di] xlatb mov [di], al inc di cmp di, buffer+bufferSize jb Loop2 ; lookup db 00h, ..., 0ffh ; 256 bytes bufferSize = 128 buffer rb bufferSize Thank Octavio and r22 for the input. YONG |
|||
![]() |
|
Madis731
Isn't there a mathematical solution for that - something shorter than look-up and faster than the solutions offered here.
I mean logical explanation is very short for this algorithm: FOR x=0 to y xor bit(x),bit(y-x) xor bit(y-x),bit(x) xor bit(x),bit(y-x) NEXT We KNOW the answer to any input without even calculating - it's not like adding with carry. There must be an easier solution. |
|||
![]() |
|
ATV
Code: bitswap_al: ; swap bits in al (destroy: ah) mov ah,al and ax,55aah ; ah = 0?0?0?0? , al = ?0?0?0?0 ror al,2 or al,ah mov ah,al and ax,9966h ; ah = ?00??00? , al = 0??00??0 rol al,4 or al,ah ror al,1 ret |
|||
![]() |
|
revolution
ATV: Nice routine. 8 bits in 9 instructions.
Can this be extended to 16 and 32 bits also? |
|||
![]() |
|
ATV
32bit algo found in http://aggregate.org/MAGIC/
Code: bitswap_ax: ; swap bits in ax push dx mov dx,ax and ax,0aaaah ; ax = ?0?0?0?0?0?0?0?0 and dx,05555h ; dx = 0?0?0?0?0?0?0?0? ror ax,2 or ax,dx mov dx,ax and ax,06666h ; ax = 0??00??00??00??0 and dx,09999h ; dx = ?00??00??00??00? ror ax,4 or ax,dx mov dx,ax and ax,01e1eh ; ax = 000????0000????0 and dx,0e1e1h ; dx = ???0000????0000? ror ax,8 or ax,dx ror ax,1 pop dx ret |
|||
![]() |
|
Madis731
Hi,
I'm gonna try something: Code: ;Bitreversal in a DWORD 90.8clk average of 104857600 passes ;The code is too long to be called optimized - N E 1 wanna try to make ;it faster? mov eax,12345678h mov ebx,eax and ebx,055555555h and eax,0AAAAAAAAh ror al,2 rol eax,8 ror al,2 rol eax,8 ror al,2 rol eax,8 ror al,2 rol eax,8 or eax,ebx mov ebx,eax and ebx,099999999h and eax,066666666h rol al,4 rol eax,8 rol al,4 rol eax,8 rol al,4 rol eax,8 rol al,4 rol eax,8 or eax,ebx ror al,1 rol eax,8 ror al,1 rol eax,8 ror al,1 rol eax,8 ror al,1 rol eax,8 bswap eax EDIT: Hehee You were faster ![]() Now try this: Code: ;32-bit try on two 16-bit passes ;59.0clk on average 10485760 passes ; |
|||
![]() |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.