flat assembler
Message board for the users of flat assembler.
Index
> Main > Fastest base64 encoding (null terminated strings) Goto page 1, 2 Next |
Author |
|
decard 12 Feb 2005, 23:52
Hi r22,
sometime ago I was working on something similar. Here's my function if you'd like to take a look. It is not optimised, but you may want to see different implementation of similar algoorithm. I see some similarities with my code, but I don't have time to compare them (at least not now, I'm too sleepy ). base64_table is just the same as yours "base64". Code: ; encodes given buffer into base64 stream proc base64_encode, .source,.src_size,.out_buf begin push ecx edx esi edi mov esi,[.source] mov edi,[.out_buf] mov ecx,[.src_size] .loop: mov edx,[esi] cmp ecx,3 jae .remainder_ok and edx,0xffff cmp ecx,2 jae .remainder_ok and edx,0xff .remainder_ok: bswap edx mov eax,edx shr eax,26 and eax,111111b mov al,[base64_table+eax] mov [edi],al inc edi mov eax,edx shr eax,20 and eax,111111b mov al,[base64_table+eax] mov [edi],al inc edi dec ecx jz .r2 mov eax,edx shr eax,14 and eax,111111b mov al,[base64_table+eax] mov [edi],al inc edi dec ecx jz .r1 mov eax,edx shr eax,8 and eax,111111b mov al,[base64_table+eax] mov [edi],al inc edi add esi,3 dec ecx jnz .loop jmp .finish .r2: mov byte[edi],'=' inc edi .r1: mov byte[edi],'=' inc edi .finish: mov byte[edi],0 sub edi,[.out_buf] mov eax,edi ; return output size in eax pop edi esi edx ecx return endp you can download whole encoding program from here: http://decard.net/download/base64.zip |
|||
12 Feb 2005, 23:52 |
|
Reverend 13 Feb 2005, 14:28
Heh. I also wrote base64 routine one day. I was curious wihich one is the fastest one, so I copied the three routines (r22's, decard's and mine) in one file and loop every call 256 times encoding the same string. I must admit that my implementation was the worst one Decard has a second place, and you r22 made really an impresing job, because yours was much faster. Congratulations
Btw. My implementation is on my site to download EDIT: I forgot to show the results, shit, here it is: r22: 5415h decard: 207BFh reverend: F43BBh () This number is the difference betwwen cycles before and after loop. I chked it using rdtsc instruction |
|||
13 Feb 2005, 14:28 |
|
Madis731 14 Feb 2005, 10:59
Though, you have to love decard's one - its the cleanest
and you can also do more with a proc than just a call, right? |
|||
14 Feb 2005, 10:59 |
|
Matrix 14 Feb 2005, 17:12
hello
quote = r22:"_I am pretty sure this is the fastest implementation possible, but I would love to be proved wrong." i whouldn't say that, i was just having problems running your testfile, it did nothing. :/ say, if you use 0 terminated strings, you cannot encode 0 bytes, whouldn't it be nicer to use a sized buffer? |
|||
14 Feb 2005, 17:12 |
|
Matrix 14 Feb 2005, 18:21
whould you test this for me?
Code: format PE GUI entry Main section '.data' data readable writeable base64 db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/',0 teststr db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/',0 fmt db '%lu',0 txtbuf rb 40 retbuf rb 0FFh section '.code' code readable executable Main: push ebp mov ebp,esp call [GetTickCount] push eax mov ebx,0FFFFFh;FFh ;# of times to call function Bench: dec ebx test ebx,ebx jz Ending push retbuf push teststr call Encode64 jmp Bench Ending: call [GetTickCount] pop ecx sub eax,ecx ;time it takes for all calls push eax push fmt push txtbuf call [wsprintf] push 0 push txtbuf push txtbuf push 0 call [MessageBox] push dword 0 push dword retbuf push dword retbuf push dword 0 call [MessageBox] mov esp,ebp pop ebp retn 0 Encode64: .retStr equ ebp+12 .theStr equ ebp+8 push ebp mov ebp,esp push esi push edi push edx push ebx mov esi, [.theStr] mov edi, [.retStr] .transfer: mov ecx,[esi] mov eax,ecx bswap ecx shr ecx,8 ;get bits aligned for b64 expand add esi,3 lea edx,[eax-$00010101] xor eax,-1 and edx,eax and edx,$808080 jnz .endloop mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi+3],al shr ecx,6 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi+2],al shr ecx,6 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi+1],al shr ecx,6 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi],al add edi,4 jmp .transfer .endloop: shr edx,8 jc .finished shr edx,8 jc .onebyte xor edx,edx shr ecx,6 mov dl,cl and dl,00111111b mov al,[base64+edx] mov ah,'=' ;.twobyte: mov [edi+2],ax shr ecx,6 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi+1],al shr ecx,6 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi+0],al jmp .finished .onebyte: xor edx,edx shr ecx,12 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi+1],al shr ecx,6 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi],al mov word [edi+2],'==' .finished: pop ebx pop edx pop edi pop esi mov esp,ebp pop ebp retn 8 section '.idata' import data readable writeable dd 0,0,0,RVA lpszKernel32, RVA lpszKernel32Table dd 0,0,0,RVA lpszUser32, RVA lpszUser32Table dd 0,0,0,0,0 lpszKernel32Table: GetTickCount dd RVA lpszGetTickCount dd 0 lpszUser32Table: MessageBox dd RVA lpszMessageBoxA wsprintf dd RVA lpszwsprintfA dd 0 lpszKernel32 db 'KERNEL32.DLL',0 lpszUser32 db 'USER32.DLL',0 lpszGetTickCount db 0,0,'GetTickCount',0 lpszMessageBoxA db 0,0,'MessageBoxA',0 lpszwsprintfA db 0,0,'wsprintfA',0 fixed Last edited by Matrix on 15 Feb 2005, 17:45; edited 1 time in total |
|||
14 Feb 2005, 18:21 |
|
r22 15 Feb 2005, 02:15
Matrix, if you copy and paste my code into fasmw it should compile correctly and the reason you saw nothing was because it was running a benchmark testing 0x0FFFFFFF function calls to the same function it then (after 13.3seconds on my computer) outputs the results (how many ms it took) to a message box.
I've tested your function, I had to change the benchmark counter from ebx to edi because your function affects ebx (for xlatb), however the results were not base 64 encoded infact characters that weren't even in the base64 db appeared in the results. Also running the function 0x0FFFFFFF times its speed was 100ms slower than mine. So fixing the bug and with a few optimizations it maybe faster. |
|||
15 Feb 2005, 02:15 |
|
Matrix 15 Feb 2005, 14:24
the base64 address does not go into EBX :/
howcome yours is faster? xlat is that slow? |
|||
15 Feb 2005, 14:24 |
|
Octavio 15 Feb 2005, 16:07
Matrix wrote: the base64 address does not go into EBX :/ those modern processors, who knows? i think that 'rol' is slower than 'shl' and both are slow instructions. also the mix of 8bits and 32 bits instructions can cause agi stalls You are encoding in reverse order you need to add a 'bswap' I think that the code of 'rr' can be optimized but it don´t needs to be optimized. |
|||
15 Feb 2005, 16:07 |
|
Matrix 15 Feb 2005, 17:42
i could give an improvement of 20% with just a few modifications,
at least another 25 % could be possible in case using all four bytes read, and adding 4 to esi Code: format PE GUI entry Main section '.data' data readable writeable base64 db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/',0 ;teststr db '1',0 teststr db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/',0 fmt db '%lu',0 txtbuf rb 40 retbuf rb 0FFh section '.code' code readable executable Main: push ebp mov ebp,esp call [GetTickCount] push eax mov ebx,0FFFFFh;FFh ;# of times to call function Bench: dec ebx test ebx,ebx jz Ending push retbuf push teststr call Encode64 jmp Bench Ending: call [GetTickCount] pop ecx sub eax,ecx ;time it takes for all calls push eax push fmt push txtbuf call [wsprintf] push 0 push txtbuf push txtbuf push 0 call [MessageBox] push dword 0 push dword retbuf push dword retbuf push dword 0 call [MessageBox] mov esp,ebp pop ebp retn 0 Encode64: .retStr equ ebp+12 .theStr equ ebp+8 push ebp mov ebp,esp push esi push edi push edx push ebx mov esi, [.theStr] mov edi, [.retStr] xor edx,edx .transfer: mov ecx,[esi] ; or cl,cl ; jz .finished ; or ch,ch ; jz .onebyte ; bswap ecx ; shr ecx,8 ;get bits aligned for b64 expand ; or cl,cl ; jz .twobyte bswap ecx shr ecx,8 ;get bits aligned for b64 expand cmp byte[esi],0 ;null end func je .finished cmp byte[esi+1],0 ;1byte to encode then end je .onebyte cmp byte[esi+2],0 ;2byte to encode then end je .twobyte add esi,3 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi+3],al shr ecx,6 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi+2],al shr ecx,6 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi+1],al shr ecx,6 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi],al add edi,4 jmp .transfer .twobyte: shr ecx,6 mov dl,cl and dl,00111111b mov al,[base64+edx] mov ah,'=' mov [edi+2],ax shr ecx,6 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi+1],al shr ecx,6 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi+0],al jmp .finished .onebyte: ; bswap ecx ; xor edx,edx ; shr ecx,20 shr ecx,12 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi+1],al shr ecx,6 mov dl,cl and dl,00111111b mov al,[base64+edx] mov [edi],al mov word [edi+2],'==' .finished: pop ebx pop edx pop edi pop esi mov esp,ebp pop ebp retn 8 section '.idata' import data readable writeable dd 0,0,0,RVA lpszKernel32, RVA lpszKernel32Table dd 0,0,0,RVA lpszUser32, RVA lpszUser32Table dd 0,0,0,0,0 lpszKernel32Table: GetTickCount dd RVA lpszGetTickCount dd 0 lpszUser32Table: MessageBox dd RVA lpszMessageBoxA wsprintf dd RVA lpszwsprintfA dd 0 lpszKernel32 db 'KERNEL32.DLL',0 lpszUser32 db 'USER32.DLL',0 lpszGetTickCount db 0,0,'GetTickCount',0 lpszMessageBoxA db 0,0,'MessageBoxA',0 lpszwsprintfA db 0,0,'wsprintfA',0 but this is insane, look at this: when using this code Code: bswap ecx shr ecx,8 ;get bits aligned for b64 expand cmp byte[esi],0 ;null end func je .finished xor edx,edx cmp byte[esi+1],0 ;1byte to encode then end je .onebyte cmp byte[esi+2],0 ;2byte to encode then end je .twobyte then is 3 times faster then Code: or cl,cl jz .finished or ch,ch jz .onebyte bswap ecx shr ecx,8 ;get bits aligned for b64 expand or cl,cl jz .twobyte or the one that is branch optimised |
|||
15 Feb 2005, 17:42 |
|
r22 15 Feb 2005, 23:22
You'd think the or cl,cl would be faster but on a Pentium 4 the old OR optimizations don't work. Also the alignment of the code and how it hit the processor play a factor. I actually switch around instructions in my code to get it to hit the processor in the fastest way possible.
What I'm testing on 3.2ghz Pentium 4 1024mb ddr ram I'm trying to unroll it right now, but so far it seems to slow the algorithm down. |
|||
15 Feb 2005, 23:22 |
|
Matrix 15 Feb 2005, 23:42
but you see what was i thinking of,
you were comparing memory with 0, while i was using registers, howcome yours is much faster possible exposition: computer detects sequential reading and reads data in cache, so read penalties won't occour ? this gives an increment of 300% in performance? in this case you may want to prefer using Prefetchnta too or reading cache lines 32 or 64 bytes in depth |
|||
15 Feb 2005, 23:42 |
|
Nicodemus 14 Feb 2008, 18:37
Hi there,
you might be interested in the following base64-encoding algorithm, which can process up to ~700 MB/s of input on my P4 D @3400 MHz.
Notes:
Code: ; Nicodemus' base64 encoding algorithm format MS COFF section '.text' code readable executable ; base64 encoding (up to >700 MB/s on input side on P4 D 3400 MHz) ; -------- optimized to encode large amounts of data -------- ; ; 1. Compile with fasm (produces obj file) ; 2. Link this obj file to your app ; 3. declare the following: ; extern "C" { void ToBase64(const unsigned char* const pbSource, const unsigned int uSourceLen, char *pOutBuf); } ; 4. Call the function public _ToBase64 _ToBase64: ; ----------- cdecl calling convention support push ebx push esi push edi mov esi,[esp+16] ; pbSource mov ecx,[esp+20] ; uSourceLen mov edi,[esp+24] ; pOutBuf cmp ecx,13 ; 13 or more bytes? jl remaining @13: sub ecx,12 ; 12 bytes will be processed mov edx,[esi] add esi,3 bswap edx shr edx,8 mov ebx,edx shr ebx,12 ; upper 12 bits mov ax,[base64lut+ebx*2] and edx,0x00000fff ; lower 12 bits mov bx,[base64lut+edx*2] shl ebx,16 mov bx,ax movd mm0,ebx mov edx,[esi] add esi,3 bswap edx shr edx,8 mov ebx,edx shr ebx,12 ; upper 12 bits mov ax,[base64lut+ebx*2] and edx,0x00000fff ; lower 12 bits mov bx,[base64lut+edx*2] shl ebx,16 mov bx,ax movd mm1,ebx psllq mm1,32 por mm0,mm1 movq2dq xmm0,mm0 mov edx,[esi] add esi,3 bswap edx shr edx,8 mov ebx,edx shr ebx,12 ; upper 12 bits mov ax,[base64lut+ebx*2] and edx,0x00000fff ; lower 12 bits mov bx,[base64lut+edx*2] shl ebx,16 mov bx,ax movd mm0,ebx mov edx,[esi] add esi,3 bswap edx shr edx,8 mov ebx,edx shr ebx,12 ; upper 12 bits mov ax,[base64lut+ebx*2] and edx,0x00000fff ; lower 12 bits mov bx,[base64lut+edx*2] shl ebx,16 mov bx,ax movd mm1,ebx psllq mm1,32 por mm0,mm1 movq2dq xmm1,mm0 pslldq xmm1,8 por xmm0,xmm1 movntdq [edi],xmm0 ; store 16 bytes aligned add edi,16 ; 16 output bytes have been generated cmp ecx,13 ; 13 or more bytes left? jge @13 remaining: ; process remaining bytes (12 bytes or less) cmp ecx,12 ; 12 bytes? je @12 cmp ecx,7 ; 7 bytes or more? jge @7 cmp ecx,6 ; 6 bytes or more? je @6 remaining4: cmp ecx,4 ; 4 bytes or more? jge @4 cmp ecx,3 ; 3 bytes? je @3 remaining2: cmp ecx,2 ; 2 bytes? je @2 cmp ecx,1 ; 1 byte? je @1 jmp cleanup @12: mov edx,[esi] add esi,3 bswap edx shr edx,8 mov ebx,edx shr ebx,12 ; upper 12 bits mov ax,[base64lut+ebx*2] and edx,0x00000fff ; lower 12 bits mov bx,[base64lut+edx*2] shl ebx,16 mov bx,ax movd mm0,ebx mov edx,[esi] add esi,3 bswap edx shr edx,8 mov ebx,edx shr ebx,12 ; upper 12 bits mov ax,[base64lut+ebx*2] and edx,0x00000fff ; lower 12 bits mov bx,[base64lut+edx*2] shl ebx,16 mov bx,ax movd mm1,ebx psllq mm1,32 por mm0,mm1 movq2dq xmm0,mm0 mov edx,[esi] add esi,3 bswap edx shr edx,8 mov ebx,edx shr ebx,12 ; upper 12 bits mov ax,[base64lut+ebx*2] and edx,0x00000fff ; lower 12 bits mov bx,[base64lut+edx*2] shl ebx,16 mov bx,ax movd mm0,ebx ;inc esi ;mov dx,[esi] ; get 2 bytes ;shl edx,8 ;dec esi ;mov dl,[esi] ; get 1 byte ;add esi,3 mov bx,[esi] ; get 2 bytes add esi,2 mov dl,[esi] ; get 1 byte shl edx,16 mov dx,bx bswap edx shr edx,8 mov ebx,edx shr ebx,12 ; upper 12 bits mov ax,[base64lut+ebx*2] and edx,0x00000fff ; lower 12 bits mov bx,[base64lut+edx*2] shl ebx,16 mov bx,ax movd mm1,ebx psllq mm1,32 por mm0,mm1 movq2dq xmm1,mm0 pslldq xmm1,8 por xmm0,xmm1 movntdq [edi],xmm0 ; store 16 bytes aligned jmp cleanup @7: ; process 6 input bytes sub ecx,6 mov edx,[esi] add esi,3 bswap edx shr edx,8 mov ebx,edx shr ebx,12 ; upper 12 bits mov ax,[base64lut+ebx*2] and edx,0x00000fff ; lower 12 bits mov bx,[base64lut+edx*2] shl ebx,16 mov bx,ax movd mm0,ebx mov edx,[esi] add esi,3 bswap edx shr edx,8 mov ebx,edx shr ebx,12 ; upper 12 bits mov ax,[base64lut+ebx*2] and edx,0x00000fff ; lower 12 bits mov bx,[base64lut+edx*2] shl ebx,16 mov bx,ax movd mm1,ebx psllq mm1,32 por mm0,mm1 movntq [edi],mm0 add edi,8 jmp remaining4 @6: ; process 6 input bytes mov edx,[esi] add esi,3 bswap edx shr edx,8 mov ebx,edx shr ebx,12 ; upper 12 bits mov ax,[base64lut+ebx*2] and edx,0x00000fff ; lower 12 bits mov bx,[base64lut+edx*2] shl ebx,16 mov bx,ax movd mm0,ebx mov bx,[esi] ; get 2 bytes add esi,2 mov dl,[esi] ; get 1 byte shl edx,16 mov dx,bx bswap edx shr edx,8 mov ebx,edx shr ebx,12 ; upper 12 bits mov ax,[base64lut+ebx*2] and edx,0x00000fff ; lower 12 bits mov bx,[base64lut+edx*2] shl ebx,16 mov bx,ax movd mm1,ebx psllq mm1,32 por mm0,mm1 movntq [edi],mm0 jmp cleanup @4: ; process three input bytes sub ecx,3 mov edx,[esi] add esi,3 bswap edx shr edx,8 mov ebx,edx shr ebx,12 mov ax,[base64lut+ebx*2] and edx,0x00000fff mov bx,[base64lut+edx*2] shl ebx,16 mov bx,ax movnti [edi],ebx add edi,4 jmp remaining2 @3: ; process three input bytes mov bx,[esi] ; get 2 bytes add esi,2 mov dl,[esi] ; get 1 byte inc esi shl edx,16 mov dx,bx bswap edx shr edx,8 mov ebx,edx shr ebx,12 mov ax,[base64lut+ebx*2] and edx,0x00000fff mov bx,[base64lut+edx*2] shl ebx,16 mov bx,ax movnti [edi],ebx jmp cleanup @2: ; process two input bytes (with padding) xor edx,edx mov dx,[esi] ; get two bytes bswap edx shr edx,8 mov ebx,edx shr ebx,12 mov ax,[base64lut+ebx*2] and edx,0x00000fff mov bx,[base64lut+edx*2] mov bh,0x3d ; '=' padding! shl ebx,16 mov bx,ax movnti [edi],ebx jmp cleanup @1: ; process one input byte (with padding) xor edx,edx mov dl,[esi] ; get one byte shl dx,4 mov eax,0x3d3d0000 ; '=' padding! mov ax,[base64lut+edx*2] movnti [edi],eax ; store cleanup: emms ; switch back to standard FPU mode pop edi pop esi pop ebx ret section '.data' data readable writeable base64lut: file 'lut.bin'
|
|||||||||||
14 Feb 2008, 18:37 |
|
Nicodemus 16 Feb 2008, 12:01
Here comes the decoder (at ~300 MB/s on my P4 D 3400 MHz).
This is a "pure" decoder variant, which accepts padding ('=' and '=='), but no "non-alphabetic" characters in the input sequence. Notes:
Hint for further optimization: This algorithm might have a potential to be accelerated by using SSE-registers for reducing memory bus traffic. Code: ; Nicodemus' base64 decoding algorithm format MS COFF section '.text' code readable executable ; base64 decoding (up to 300 MB/s on input side on P4 D 3400 MHz) ; ; 1. Compile with fasm (produces obj file) ; 2. Link this obj file to your app ; 3. declare the following: ; extern "C" { unsigned int FromBase64(const char* const pbSource, const unsigned int uSourceLen, unsigned char *pOutBuf); } ; 4. Call the function public _FromBase64 _FromBase64: ; ----------- cdecl calling convention support push ebx push esi push edi push ebp mov esi,[esp+20] ; pbSource mov ecx,[esp+24] ; uSourceLen mov edi,[esp+28] ; pOutBuf mov ebx,base64lut ; for xlatb decode4bytes: cmp ecx,4 jle final4 sub ecx,4 ; four input bytes will be processed mov eax,[esi] ; get four input bytes add esi,4 xor ebp,ebp ; clear ebp ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,26 ; shift them to "top" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,20 ; shift them to "top - 6 bits" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,14 ; shift them to "top - 12 bits" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xor edx,edx ; clear edx xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,8 ; shift them to "top - 18 bits" of register or ebp,edx ; store them in output register ; three decoded output bytes have been generated, which have to be stored mov edx,ebp bswap edx mov [edi],dx shr edx,16 add edi,2 mov [edi],dl inc edi jmp decode4bytes final4: mov eax,[esi] ; get four input bytes xor ebp,ebp ; clear ebp ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,26 ; shift them to "top" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,20 ; shift them to "top - 6 bits" of register or ebp,edx ; store them in output register ; at this point, padding might occur cmp al,0x3d jz padding2 ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,14 ; shift them to "top - 12 bits" of register or ebp,edx ; store them in output register ; at this point, padding might occur cmp al,0x3d jz padding1 ; -------- decode one byte from input --------- xor edx,edx ; clear edx xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,8 ; shift them to "top - 18 bits" of register or ebp,edx ; store them in output register ; three decoded output bytes have been generated, which have to be stored mov edx,ebp bswap edx mov [edi],dx shr edx,16 add edi,2 mov [edi],dl xor eax,eax ; 0 means "no padding" jmp cleanup padding2: ; ebp contains only one valid output byte mov edx,ebp shr edx,24 mov [edi],dl mov eax,2 ; 2 means "2 bytes padding" jmp cleanup padding1: ; ebp contains only two valid output bytes mov edx,ebp bswap edx mov [edi],dx mov eax,1 ; 1 means "1 byte padding" cleanup: pop ebp pop edi pop esi pop ebx ret section '.data' data readable writeable base64lut: file 'dec.bin'
|
|||||||||||
16 Feb 2008, 12:01 |
|
Alphonso 17 Feb 2008, 14:23
Hi r22,
you could maybe try deleting all instances of Code: and dl,00111111b Code: base64 db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/' db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/' db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/' db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/',0 |
|||
17 Feb 2008, 14:23 |
|
Nicodemus 17 Feb 2008, 20:15
Hi there!
I optimized the decoder (see post above). This version requires a 16-byte-aligned base64 source. With this precondition, this version of the decoder provides a ~13.7 % speed increase against the previous one. Note: You will require the lookup-table-file "dec.bin" from my previous post to assemble this one. Code: ; Nicodemus' base64 decoding algorithm format MS COFF section '.text' code readable executable ; base64 decoding (up to 310 MB/s on input side on P4 D 3400 MHz) ; ; 1. Compile with fasm (produces obj file) ; 2. Link this obj file to your app ; 3. declare the following: ; extern "C" { unsigned int FromBase64(const char* const pbSource, const unsigned int uSourceLen, unsigned char *pOutBuf); } ; 4. Call the function public _FromBase64 _FromBase64: ; ----------- cdecl calling convention support push ebx push esi push edi push ebp mov esi,[esp+20] ; pbSource mov ecx,[esp+24] ; uSourceLen mov edi,[esp+28] ; pOutBuf cmp ecx,0 ; nothing to do! jz cleanup ;xor ebx,ebx mov ebx,base64lut ; for xlatb @16: cmp ecx,16 jl @15orless sub ecx,16 movdqa xmm0,[esi] ; get 16 _aligned_ input bytes add esi,16 ; -------- bytes 0..3 movd eax,xmm0 psrldq xmm0,4 xor ebp,ebp ; clear ebp ; -------- decode one byte from input --------- ; mov bl,al ; mov dl[base64lut+ebx] xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,26 ; shift them to "top" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,20 ; shift them to "top - 6 bits" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,14 ; shift them to "top - 12 bits" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xor edx,edx ; clear edx xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shl edx,8 ; shift them to "top - 18 bits" of register or ebp,edx ; store them in output register ; three decoded output bytes have been generated, which have to be stored bswap ebp movd mm0,ebp ; mm0 contains 3 valid bytes now ; -------- bytes 4..7 movd eax,xmm0 psrldq xmm0,4 xor ebp,ebp ; clear ebp ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,26 ; shift them to "top" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,20 ; shift them to "top - 6 bits" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,14 ; shift them to "top - 12 bits" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xor edx,edx ; clear edx xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shl edx,8 ; shift them to "top - 18 bits" of register or ebp,edx ; store them in output register ; three decoded output bytes have been generated, which have to be stored bswap ebp movd mm1,ebp psllq mm1,24 por mm0,mm1 ; mm0 contains 6 valid bytes now ; -------- bytes 8..11 movd eax,xmm0 psrldq xmm0,4 xor ebp,ebp ; clear ebp ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,26 ; shift them to "top" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,20 ; shift them to "top - 6 bits" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,14 ; shift them to "top - 12 bits" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xor edx,edx ; clear edx xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shl edx,8 ; shift them to "top - 18 bits" of register or ebp,edx ; store them in output register ; three decoded output bytes have been generated, which have to be stored bswap ebp movd mm1,ebp psllq mm1,48 por mm0,mm1 ; mm0 contains 8 valid bytes now and ebp,0x00ff0000 ; select remaining (unstored) output byte shl ebp,8 ; move to high byte (will be low byte later) movntq [edi],mm0 ; store 8 output bytes add edi,8 ; -------- bytes 12..15 movd eax,xmm0 ; -------- decode one byte from input --------- xor edx,edx xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,18 ; shift them to "top - 8 bits" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xor edx,edx xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,12 ; shift them to "top - 14 bits" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xor edx,edx xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,6 ; shift them to "top - 20 bits" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xor edx,edx ; clear edx xlatb ; decode one byte (yields 6 decoded bits) mov dl,al or ebp,edx ; store them in output register ; _four_ decoded and unstored output bytes are in ebp now bswap ebp movnti [edi],ebp add edi,4 jmp @16 @15orless: cmp ecx,4 jle final4 sub ecx,4 ; four input bytes will be processed mov eax,[esi] ; get four input bytes add esi,4 xor ebp,ebp ; clear ebp ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,26 ; shift them to "top" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,20 ; shift them to "top - 6 bits" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,14 ; shift them to "top - 12 bits" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xor edx,edx ; clear edx xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shl edx,8 ; shift them to "top - 18 bits" of register or ebp,edx ; store them in output register ; three decoded output bytes have been generated, which have to be stored mov edx,ebp bswap edx mov [edi],dx shr edx,16 add edi,2 mov [edi],dl inc edi jmp @15orless final4: cmp ecx,0 xor eax,eax ; no padding! jz cleanup mov eax,[esi] ; get four input bytes xor ebp,ebp ; clear ebp ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,26 ; shift them to "top" of register or ebp,edx ; store them in output register ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,20 ; shift them to "top - 6 bits" of register or ebp,edx ; store them in output register ; at this point, padding might occur cmp al,0x3d jz padding2 ; -------- decode one byte from input --------- xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shl edx,14 ; shift them to "top - 12 bits" of register or ebp,edx ; store them in output register ; at this point, padding might occur cmp al,0x3d jz padding1 ; -------- decode one byte from input --------- xor edx,edx ; clear edx xlatb ; decode one byte (yields 6 decoded bits) mov dl,al shr eax,8 ; shift to next encoded byte shl edx,8 ; shift them to "top - 18 bits" of register or ebp,edx ; store them in output register ; three decoded output bytes have been generated, which have to be stored mov edx,ebp bswap edx mov [edi],dx shr edx,16 add edi,2 mov [edi],dl xor eax,eax ; 0 means "no padding" jmp cleanup padding2: ; ebp contains only one valid output byte mov edx,ebp shr edx,24 mov [edi],dl mov eax,2 ; 2 means "2 bytes padding" jmp cleanup padding1: ; ebp contains only two valid output bytes mov edx,ebp bswap edx mov [edi],dx mov eax,1 ; 1 means "1 byte padding" cleanup: emms pop ebp pop edi pop esi pop ebx ret section '.data' data readable writable base64lut: file 'dec.bin' |
|||
17 Feb 2008, 20:15 |
|
f0dder 17 Feb 2008, 20:20
I'm wondering: is this just for fun, or do you have some real-world examples where BASE64 needs a lot of speed?
|
|||
17 Feb 2008, 20:20 |
|
r22 17 Feb 2008, 22:18
MIME uses base64 to send attachments, so a fast base64 DEcoder could be optimal for a Virus scanner attached to a mail server.
|
|||
17 Feb 2008, 22:18 |
|
revolution 17 Feb 2008, 22:27
One thing to note with the code is that it needs SSE2. You might want to consider detecting if the CPU is actually capable of SSE2 and drop to alternative code for older CPUs.
|
|||
17 Feb 2008, 22:27 |
|
Nicodemus 18 Feb 2008, 06:49
Actually, I prefer choosing between different implementations (e.g. SSE vs. MMX vs. classic code) by doing a CPU detection once at application start and selecting from a set of CPU-feature-specific libraries just there.
This avoids the overhead of doing this in every function, which is a performance hit. Nevertheless, you are right: Both my algorithms above require SSE2; the 2nd version of my decoder also requires a 16-byte-aligned input byte array). |
|||
18 Feb 2008, 06:49 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.