flat assembler
Message board for the users of flat assembler.

Index > Main > Fastest base64 encoding (null terminated strings)

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
r22



Joined: 27 Dec 2004
Posts: 805
r22
I was working on a fast base64 null terminated string encoding algorithm. Say for appending to HTTP POSTs.
Here's my simple benchmark and function.
If you can improve the speed I would be grateful.
Code:
format PE GUI
entry Main

section '.data' data readable writeable
base64 db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/',0
teststr db 'username:password1',0
fmt    db '%lu',0
txtbuf rb 40
retbuf rb 0FFh

section '.code' code readable executable

  Main:
    push ebp
    mov ebp,esp
    call [GetTickCount]
    push eax
    mov ebx,0FFFFFFFh ;# of times to call function
   Bench:
    dec ebx
    test ebx,ebx
    jz Ending
    push retbuf
    push teststr
    call Encode64
    jmp Bench
   Ending:
    call [GetTickCount]
    pop ecx
    sub eax,ecx ;time it takes for all calls
    push eax
    push fmt
    push txtbuf
    call [wsprintf]
    push 0
    push txtbuf
    push txtbuf
    push 0
    call [MessageBox]
    push dword 0
    push dword retbuf
    push dword retbuf
    push dword 0
    call [MessageBox]
    mov esp,ebp
    pop ebp
    retn 0

  Encode64:
    .retStr equ ebp+12
    .theStr equ ebp+8
    push ebp
    mov ebp,esp
    push esi
    push edi
    mov esi, [.theStr]
    mov edi, [.retStr]
    .transfer:
    mov ecx, [esi]
    bswap ecx
    shr ecx,8 ;get bits aligned for b64 expand
    cmp byte[esi],0 ;null end func
    je .finished
    xor edx,edx
    cmp byte[esi+1],0 ;1byte to encode then end
    je .onebyte
    cmp byte[esi+2],0 ;2byte to encode then end
    je .twobyte
    mov dl,cl         ;main expansion
    and dl,00111111b
    mov dl,byte[base64+edx]
    mov byte[edi+3],dl
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    shr ecx,6
    mov dl,byte[base64+edx]
    mov byte[edi+2],dl
    mov dl,cl
    and dl,00111111b
    shr ecx,6
    mov dl,byte[base64+edx]
    mov byte[edi+1],dl
    mov dl,cl
    mov dl,byte[base64+edx]
    mov byte[edi],dl
    jmp .skipper
  .onebyte:
    mov word[edi+2],'=='
    shr ecx,12 ;byte in cl
    mov dl,cl
    and dl,00111111b
    shr ecx,6
    mov dl,byte[base64+edx]
    mov byte[edi+1],dl
    mov dl,cl
    mov dl,byte[base64+edx]
    mov byte[edi],dl
    jmp .finished
  .twobyte:
    mov byte[edi+3],'='
    shr ecx,6 ;byte in cl
    mov dl,cl
    and dl,00111111b
    shr ecx,6
    mov dl,byte[base64+edx]
    mov byte[edi+2],dl
    mov dl,cl
    and dl,00111111b
    shr ecx,6
    mov dl,byte[base64+edx]
    mov byte[edi+1],dl
    mov dl,cl
    mov dl,byte[base64+edx]
    mov byte[edi],dl
    jmp .finished
  .skipper:
    add esi,3
    add edi,4
    jmp .transfer
  .finished:
    pop edi
    pop esi
    mov esp,ebp
    pop ebp
    retn 8

section '.idata' import data readable writeable
  dd 0,0,0,RVA lpszKernel32, RVA lpszKernel32Table
  dd 0,0,0,RVA lpszUser32, RVA lpszUser32Table
  dd 0,0,0,0,0

  lpszKernel32Table:
    GetTickCount     dd RVA lpszGetTickCount
    dd 0

  lpszUser32Table:
    MessageBox       dd RVA lpszMessageBoxA
    wsprintf         dd RVA lpszwsprintfA
    dd 0

  lpszKernel32       db 'KERNEL32.DLL',0
  lpszUser32         db 'USER32.DLL',0
  lpszGetTickCount   db 0,0,'GetTickCount',0
  lpszMessageBoxA    db 0,0,'MessageBoxA',0
  lpszwsprintfA      db 0,0,'wsprintfA',0    


Sorry for the raw import table (I have issues :P).
I am pretty sure this is the fastest implementation possible, but I would love to be proved wrong.
Post 12 Feb 2005, 23:08
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
decard



Joined: 11 Sep 2003
Posts: 1092
Location: Poland
decard
Hi r22,
sometime ago I was working on something similar. Here's my function if you'd like to take a look. It is not optimised, but you may want to see different implementation of similar algoorithm. I see some similarities with my code, but I don't have time to compare them (at least not now, I'm too sleepy Wink). base64_table is just the same as yours "base64".

Code:
; encodes given buffer into base64 stream
proc base64_encode, .source,.src_size,.out_buf
        begin
        push    ecx edx esi edi
        mov     esi,[.source]
        mov     edi,[.out_buf]
        mov     ecx,[.src_size]
     .loop:
        mov     edx,[esi]
        cmp     ecx,3
        jae     .remainder_ok
        and     edx,0xffff
        cmp     ecx,2
        jae     .remainder_ok
        and     edx,0xff
     .remainder_ok:
        bswap   edx

        mov     eax,edx
        shr     eax,26
        and     eax,111111b
        mov     al,[base64_table+eax]
        mov     [edi],al
        inc     edi

        mov     eax,edx
        shr     eax,20
        and     eax,111111b
        mov     al,[base64_table+eax]
        mov     [edi],al
        inc     edi
        dec     ecx
        jz      .r2

        mov     eax,edx
        shr     eax,14
        and     eax,111111b
        mov     al,[base64_table+eax]
        mov     [edi],al
        inc     edi
        dec     ecx
        jz      .r1

        mov     eax,edx
        shr     eax,8
        and     eax,111111b
        mov     al,[base64_table+eax]
        mov     [edi],al
        inc     edi

        add     esi,3
        dec     ecx
        jnz     .loop
        jmp     .finish
     .r2:
        mov     byte[edi],'='
        inc     edi
     .r1:
        mov     byte[edi],'='
        inc     edi
     .finish:
        mov     byte[edi],0
        sub     edi,[.out_buf]
        mov     eax,edi         ; return output size in eax
        pop     edi esi edx ecx
        return
endp    


you can download whole encoding program from here: http://decard.net/download/base64.zip
Post 12 Feb 2005, 23:52
View user's profile Send private message Visit poster's website Reply with quote
Reverend



Joined: 24 Aug 2004
Posts: 408
Location: Poland
Reverend
Heh. I also wrote base64 routine one day. I was curious wihich one is the fastest one, so I copied the three routines (r22's, decard's and mine) in one file and loop every call 256 times encoding the same string. I must admit that my implementation was the worst one Razz Decard has a second place, and you r22 made really an impresing job, because yours was much faster. Congratulations Laughing

Btw. My implementation is on my site to download

EDIT: I forgot to show the results, shit, here it is:
r22: 5415h
decard: 207BFh
reverend: F43BBh (Laughing)

This number is the difference betwwen cycles before and after loop. I chked it using rdtsc instruction
Post 13 Feb 2005, 14:28
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
Though, you have to love decard's one - its the cleanest Smile
and you can also do more with a proc than just a call, right?
Post 14 Feb 2005, 10:59
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1171
Location: Overflow
Matrix
hello
quote = r22:"_I am pretty sure this is the fastest implementation possible, but I would love to be proved wrong."

i whouldn't say that, i was just having problems running your testfile, it did nothing. :/

say, if you use 0 terminated strings, you cannot encode 0 bytes, whouldn't it be nicer to use a sized buffer?
Post 14 Feb 2005, 17:12
View user's profile Send private message Visit poster's website Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1171
Location: Overflow
Matrix
whould you test this for me?
Code:
format PE GUI
entry Main

section '.data' data readable writeable
base64 db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/',0
teststr db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/',0
fmt    db '%lu',0
txtbuf rb 40
retbuf rb 0FFh

section '.code' code readable executable

  Main:
    push ebp
    mov ebp,esp
    call [GetTickCount]
    push eax
    mov ebx,0FFFFFh;FFh ;# of times to call function
   Bench:
    dec ebx
    test ebx,ebx
    jz Ending
    push retbuf
    push teststr
    call Encode64
    jmp Bench
   Ending:
    call [GetTickCount]
    pop ecx
    sub eax,ecx ;time it takes for all calls
    push eax
    push fmt
    push txtbuf
    call [wsprintf]
    push 0
    push txtbuf
    push txtbuf
    push 0
    call [MessageBox]
    push dword 0
    push dword retbuf
    push dword retbuf
    push dword 0
    call [MessageBox]
    mov esp,ebp
    pop ebp
    retn 0

  Encode64:
    .retStr equ ebp+12
    .theStr equ ebp+8
    push ebp
    mov ebp,esp
    push esi
    push edi
    push edx
    push ebx
    mov esi, [.theStr]
    mov edi, [.retStr]
    .transfer:
    mov ecx,[esi]
    mov eax,ecx
    bswap ecx
    shr ecx,8 ;get bits aligned for b64 expand
    add esi,3
    lea edx,[eax-$00010101]
    xor eax,-1
    and edx,eax
    and edx,$808080
    jnz .endloop
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi+3],al
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi+2],al
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi+1],al
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi],al
    add edi,4
    jmp .transfer
.endloop:
    shr edx,8
    jc .finished
    shr edx,8
    jc .onebyte
    xor edx,edx
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov ah,'=' ;.twobyte:
    mov [edi+2],ax
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi+1],al
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi+0],al
    jmp .finished
  .onebyte:
    xor edx,edx
    shr ecx,12
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi+1],al
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi],al
    mov word [edi+2],'=='
  .finished:
    pop ebx
    pop edx
    pop edi
    pop esi
    mov esp,ebp
    pop ebp
    retn 8

section '.idata' import data readable writeable
  dd 0,0,0,RVA lpszKernel32, RVA lpszKernel32Table
  dd 0,0,0,RVA lpszUser32, RVA lpszUser32Table
  dd 0,0,0,0,0

  lpszKernel32Table:
    GetTickCount     dd RVA lpszGetTickCount
    dd 0

  lpszUser32Table:
    MessageBox       dd RVA lpszMessageBoxA
    wsprintf         dd RVA lpszwsprintfA
    dd 0

  lpszKernel32       db 'KERNEL32.DLL',0
  lpszUser32         db 'USER32.DLL',0
  lpszGetTickCount   db 0,0,'GetTickCount',0
  lpszMessageBoxA    db 0,0,'MessageBoxA',0
  lpszwsprintfA      db 0,0,'wsprintfA',0
    

fixed


Last edited by Matrix on 15 Feb 2005, 17:45; edited 1 time in total
Post 14 Feb 2005, 18:21
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Matrix, if you copy and paste my code into fasmw it should compile correctly and the reason you saw nothing was because it was running a benchmark testing 0x0FFFFFFF function calls to the same function it then (after 13.3seconds on my computer) outputs the results (how many ms it took) to a message box.

I've tested your function, I had to change the benchmark counter from ebx to edi because your function affects ebx (for xlatb), however the results were not base 64 encoded infact characters that weren't even in the base64 db appeared in the results. Also running the function 0x0FFFFFFF times its speed was 100ms slower than mine.

So fixing the bug and with a few optimizations it maybe faster.
Post 15 Feb 2005, 02:15
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1171
Location: Overflow
Matrix
the base64 address does not go into EBX :/
howcome yours is faster?
xlat is that slow? Confused
Post 15 Feb 2005, 14:24
View user's profile Send private message Visit poster's website Reply with quote
Octavio



Joined: 21 Jun 2003
Posts: 366
Location: Spain
Octavio
Matrix wrote:
the base64 address does not go into EBX :/
howcome yours is faster?
xlat is that slow? Confused

those modern processors, who knows? Very Happy
i think that 'rol' is slower than 'shl' and both are slow instructions.
also the mix of 8bits and 32 bits instructions
can cause agi stalls Shocked

You are encoding in reverse order you need to add a 'bswap'

I think that the code of 'rr' can be optimized but it don´t needs to be optimized.
Post 15 Feb 2005, 16:07
View user's profile Send private message Visit poster's website Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1171
Location: Overflow
Matrix
i could give an improvement of 20% with just a few modifications,
at least another 25 % could be possible in case using all four bytes read, and adding 4 to esi

Code:
format PE GUI
entry Main

section '.data' data readable writeable
base64 db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/',0
;teststr db '1',0
teststr db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/',0
fmt    db '%lu',0
txtbuf rb 40
retbuf rb 0FFh

section '.code' code readable executable

  Main:
    push ebp
    mov ebp,esp
    call [GetTickCount]
    push eax
    mov ebx,0FFFFFh;FFh ;# of times to call function
   Bench:
    dec ebx
    test ebx,ebx
    jz Ending
    push retbuf
    push teststr
    call Encode64
    jmp Bench
   Ending:
    call [GetTickCount]
    pop ecx
    sub eax,ecx ;time it takes for all calls
    push eax
    push fmt
    push txtbuf
    call [wsprintf]
    push 0
    push txtbuf
    push txtbuf
    push 0
    call [MessageBox]
    push dword 0
    push dword retbuf
    push dword retbuf
    push dword 0
    call [MessageBox]
    mov esp,ebp
    pop ebp
    retn 0

  Encode64:
    .retStr equ ebp+12
    .theStr equ ebp+8
    push ebp
    mov ebp,esp
    push esi
    push edi
    push edx
    push ebx
    mov esi, [.theStr]
    mov edi, [.retStr]
    xor edx,edx
    .transfer:
    mov ecx,[esi]

;    or cl,cl
;     jz .finished
;    or ch,ch
;     jz .onebyte
;    bswap ecx
;    shr ecx,8 ;get bits aligned for b64 expand
;    or cl,cl
;     jz .twobyte

    bswap ecx
    shr ecx,8 ;get bits aligned for b64 expand
    cmp byte[esi],0 ;null end func
    je .finished
    cmp byte[esi+1],0 ;1byte to encode then end
    je .onebyte
    cmp byte[esi+2],0 ;2byte to encode then end
    je .twobyte

    add esi,3
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi+3],al
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi+2],al
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi+1],al
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi],al
    add edi,4
    jmp .transfer
.twobyte:
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov ah,'='
    mov [edi+2],ax
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi+1],al
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi+0],al
    jmp .finished
  .onebyte:
;    bswap ecx
;    xor edx,edx
;    shr ecx,20
    shr ecx,12
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi+1],al
    shr ecx,6
    mov dl,cl
    and dl,00111111b
    mov al,[base64+edx]
    mov [edi],al
    mov word [edi+2],'=='
  .finished:
    pop ebx
    pop edx
    pop edi
    pop esi
    mov esp,ebp
    pop ebp
    retn 8

section '.idata' import data readable writeable
  dd 0,0,0,RVA lpszKernel32, RVA lpszKernel32Table
  dd 0,0,0,RVA lpszUser32, RVA lpszUser32Table
  dd 0,0,0,0,0

  lpszKernel32Table:
    GetTickCount     dd RVA lpszGetTickCount
    dd 0

  lpszUser32Table:
    MessageBox       dd RVA lpszMessageBoxA
    wsprintf         dd RVA lpszwsprintfA
    dd 0

  lpszKernel32       db 'KERNEL32.DLL',0
  lpszUser32         db 'USER32.DLL',0
  lpszGetTickCount   db 0,0,'GetTickCount',0
  lpszMessageBoxA    db 0,0,'MessageBoxA',0
  lpszwsprintfA      db 0,0,'wsprintfA',0
    



but this is insane, look at this:
when using this code
Code:
    bswap ecx
    shr ecx,8 ;get bits aligned for b64 expand
    cmp byte[esi],0 ;null end func
    je .finished
    xor edx,edx
    cmp byte[esi+1],0 ;1byte to encode then end
    je .onebyte
    cmp byte[esi+2],0 ;2byte to encode then end
    je .twobyte
    


then is 3 times faster then
Code:
    or cl,cl
     jz .finished
    or ch,ch
     jz .onebyte
    bswap ecx
    shr ecx,8 ;get bits aligned for b64 expand
    or cl,cl
     jz .twobyte
    

or
the one that is branch optimised
Post 15 Feb 2005, 17:42
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
You'd think the or cl,cl would be faster but on a Pentium 4 the old OR optimizations don't work. Also the alignment of the code and how it hit the processor play a factor. I actually switch around instructions in my code to get it to hit the processor in the fastest way possible.

What I'm testing on
3.2ghz Pentium 4
1024mb ddr ram

I'm trying to unroll it right now, but so far it seems to slow the algorithm down.
Post 15 Feb 2005, 23:22
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Matrix



Joined: 04 Sep 2004
Posts: 1171
Location: Overflow
Matrix
but you see what was i thinking of,
you were comparing memory with 0,
while i was using registers, howcome yours is much faster Confused

possible exposition:
computer detects sequential reading and reads data in cache,
so read penalties won't occour ?
this gives an increment of 300% in performance?

in this case you may want to prefer using Prefetchnta too or reading cache lines 32 or 64 bytes in depth
Post 15 Feb 2005, 23:42
View user's profile Send private message Visit poster's website Reply with quote
Nicodemus



Joined: 11 Feb 2008
Posts: 6
Nicodemus
Hi there,

you might be interested in the following base64-encoding algorithm, which can process up to ~700 MB/s of input on my P4 D @3400 MHz.


  • Uses Dan Rollins' idea of a 12-bit lookup-table (8192 bytes).
  • Supports padding ('=' and '==') and arbitrary input lengths.


Notes:

  • pOutBuf will not be null terminated
  • pOutBuf must be
    Code:
    (uSourceLen-uSourceLen%3)*4/3+((uSourceLen%3)?4:0)    
    bytes long.

  • add an additional byte for null termination
  • pbSource should be aligned to 16 bytes for maximum speed
  • Change movntdq to movdqa or movdqu (and movnti to mov), if output is short and will be used immediately after encoding


Code:
; Nicodemus' base64 encoding algorithm
format MS COFF

section '.text' code readable executable

; base64 encoding (up to >700 MB/s on input side on P4 D 3400 MHz)
; -------- optimized to encode large amounts of data --------
; 
; 1. Compile with fasm (produces obj file)
; 2. Link this obj file to your app
; 3. declare the following:
;  extern "C" { void ToBase64(const unsigned char* const pbSource, const unsigned int uSourceLen, char *pOutBuf); }
; 4. Call the function
public _ToBase64

 _ToBase64:
   ; ----------- cdecl calling convention support
      push   ebx
  push   esi
  push   edi

      mov     esi,[esp+16]    ; pbSource
  mov     ecx,[esp+20]    ; uSourceLen
        mov     edi,[esp+24]    ; pOutBuf

       cmp     ecx,13          ; 13 or more bytes?
 jl      remaining

@13:
   sub     ecx,12          ; 12 bytes will be processed

    mov     edx,[esi]
   add     esi,3
       bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12          ; upper 12 bits
     mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff  ; lower 12 bits
     mov     bx,[base64lut+edx*2]
        shl     ebx,16
      mov     bx,ax
       movd    mm0,ebx

 mov     edx,[esi]
   add     esi,3
       bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12          ; upper 12 bits
     mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff  ; lower 12 bits
     mov     bx,[base64lut+edx*2]
        shl     ebx,16
      mov     bx,ax
       movd    mm1,ebx

 psllq   mm1,32
      por     mm0,mm1
     movq2dq xmm0,mm0

        mov     edx,[esi]
   add     esi,3
       bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12          ; upper 12 bits
     mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff  ; lower 12 bits
     mov     bx,[base64lut+edx*2]
        shl     ebx,16
      mov     bx,ax
       movd    mm0,ebx

 mov     edx,[esi]
   add     esi,3
       bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12          ; upper 12 bits
     mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff  ; lower 12 bits
     mov     bx,[base64lut+edx*2]
        shl     ebx,16
      mov     bx,ax
       movd    mm1,ebx

 psllq   mm1,32
      por     mm0,mm1
     movq2dq xmm1,mm0

        pslldq  xmm1,8
      por     xmm0,xmm1

       movntdq [edi],xmm0      ; store 16 bytes aligned
    add     edi,16          ; 16 output bytes have been generated

   cmp     ecx,13          ; 13 or more bytes left?
    jge     @13

remaining:       ; process remaining bytes (12 bytes or less)
        cmp     ecx,12          ; 12 bytes?
 je      @12
 cmp     ecx,7           ; 7 bytes or more?
  jge     @7
  cmp     ecx,6           ; 6 bytes or more?
  je      @6
remaining4:
       cmp     ecx,4           ; 4 bytes or more?
  jge     @4
  cmp     ecx,3           ; 3 bytes?
  je      @3
remaining2:
       cmp     ecx,2           ; 2 bytes?
  je      @2
  cmp     ecx,1           ; 1 byte?
   je      @1
  jmp     cleanup

@12:
     mov     edx,[esi]
   add     esi,3
       bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12          ; upper 12 bits
     mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff  ; lower 12 bits
     mov     bx,[base64lut+edx*2]
        shl     ebx,16
      mov     bx,ax
       movd    mm0,ebx

 mov     edx,[esi]
   add     esi,3
       bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12          ; upper 12 bits
     mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff  ; lower 12 bits
     mov     bx,[base64lut+edx*2]
        shl     ebx,16
      mov     bx,ax
       movd    mm1,ebx

 psllq   mm1,32
      por     mm0,mm1
     movq2dq xmm0,mm0

        mov     edx,[esi]
   add     esi,3
       bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12          ; upper 12 bits
     mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff  ; lower 12 bits
     mov     bx,[base64lut+edx*2]
        shl     ebx,16
      mov     bx,ax
       movd    mm0,ebx

 ;inc     esi
        ;mov     dx,[esi] ; get 2 bytes
     ;shl     edx,8
      ;dec     esi
        ;mov     dl,[esi] ; get 1 byte
      ;add     esi,3
      mov     bx,[esi] ; get 2 bytes
      add     esi,2
       mov     dl,[esi] ; get 1 byte
       shl     edx,16
      mov     dx,bx
       bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12          ; upper 12 bits
     mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff  ; lower 12 bits
     mov     bx,[base64lut+edx*2]
        shl     ebx,16
      mov     bx,ax
       movd    mm1,ebx

 psllq   mm1,32
      por     mm0,mm1
     movq2dq xmm1,mm0

        pslldq  xmm1,8
      por     xmm0,xmm1

       movntdq [edi],xmm0      ; store 16 bytes aligned
    jmp     cleanup

@7:          ; process 6 input bytes
     sub     ecx,6

   mov     edx,[esi]
   add     esi,3
       bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12          ; upper 12 bits
     mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff  ; lower 12 bits
     mov     bx,[base64lut+edx*2]
        shl     ebx,16
      mov     bx,ax
       movd    mm0,ebx

 mov     edx,[esi]
   add     esi,3
       bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12          ; upper 12 bits
     mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff  ; lower 12 bits
     mov     bx,[base64lut+edx*2]
        shl     ebx,16
      mov     bx,ax
       movd    mm1,ebx

 psllq   mm1,32
      por     mm0,mm1
     movntq  [edi],mm0

       add edi,8
   jmp remaining4

@6:          ; process 6 input bytes
      mov     edx,[esi]
   add     esi,3
       bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12          ; upper 12 bits
     mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff  ; lower 12 bits
     mov     bx,[base64lut+edx*2]
        shl     ebx,16
      mov     bx,ax
       movd    mm0,ebx

 mov     bx,[esi] ; get 2 bytes
      add     esi,2
       mov     dl,[esi] ; get 1 byte
       shl     edx,16
      mov     dx,bx
       bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12          ; upper 12 bits
     mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff  ; lower 12 bits
     mov     bx,[base64lut+edx*2]
        shl     ebx,16
      mov     bx,ax
       movd    mm1,ebx

 psllq   mm1,32
      por     mm0,mm1
     movntq  [edi],mm0
   jmp cleanup

@4:              ; process three input bytes
 sub     ecx,3

   mov     edx,[esi]
   add     esi,3
       bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12
      mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff
      mov     bx,[base64lut+edx*2]
        shl     ebx,16
      mov     bx,ax

   movnti  [edi],ebx

       add     edi,4
       jmp     remaining2

@3:               ; process three input bytes
 mov     bx,[esi] ; get 2 bytes
      add     esi,2
       mov     dl,[esi] ; get 1 byte
       inc     esi
 shl     edx,16
      mov     dx,bx
       bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12
      mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff
      mov     bx,[base64lut+edx*2]
        shl     ebx,16
      mov     bx,ax

   movnti  [edi],ebx
   jmp     cleanup

@2:          ; process two input bytes (with padding)
    xor     edx,edx
     mov     dx,[esi]  ; get two bytes
   bswap   edx
 shr     edx,8
       mov     ebx,edx
     shr     ebx,12
      mov     ax,[base64lut+ebx*2]
        and     edx,0x00000fff
      mov     bx,[base64lut+edx*2]
        mov     bh,0x3d         ; '=' padding!
    shl     ebx,16
      mov     bx,ax

   movnti  [edi],ebx
   jmp     cleanup

@1:          ; process one input byte (with padding)
     xor     edx,edx
     mov     dl,[esi]  ; get one byte
    shl     dx,4
        mov     eax,0x3d3d0000  ; '=' padding!
    mov     ax,[base64lut+edx*2]

    movnti  [edi],eax        ; store

cleanup:
        emms            ; switch back to standard FPU mode
  pop     edi
 pop     esi
 pop     ebx
 ret

section '.data' data readable writeable
    base64lut: file 'lut.bin'
    


Description: 12-bit lookup table for base64 encoder.
Rename to lut.bin before use.

Download
Filename: lut.bin.txt
Filesize: 8 KB
Downloaded: 230 Time(s)

Post 14 Feb 2008, 18:37
View user's profile Send private message Reply with quote
Nicodemus



Joined: 11 Feb 2008
Posts: 6
Nicodemus
Here comes the decoder (at ~300 MB/s on my P4 D 3400 MHz).

This is a "pure" decoder variant, which accepts padding ('=' and '=='), but no "non-alphabetic" characters in the input sequence.

Notes:

  • input buffer and input size must be >0 and multiples of 4.
  • output buffer must be
    Code:
    (outsize-outsize%4)*3/4+((outsize%4)?3:0)    
    bytes large.
  • returns the number of padding bytes (0, 1 or 2) in eax.


Hint for further optimization: This algorithm might have a potential to be accelerated by using SSE-registers for reducing memory bus traffic.

Code:
; Nicodemus' base64 decoding algorithm
format MS COFF

section '.text' code readable executable

; base64 decoding (up to 300 MB/s on input side on P4 D 3400 MHz)
;
; 1. Compile with fasm (produces obj file)
; 2. Link this obj file to your app
; 3. declare the following:
;       extern "C" { unsigned int FromBase64(const char* const pbSource, const unsigned int uSourceLen, unsigned char *pOutBuf); }
; 4. Call the function
public _FromBase64

 _FromBase64:
        ; ----------- cdecl calling convention support
        push   ebx
        push   esi
        push   edi
        push   ebp

        mov     esi,[esp+20]    ; pbSource
        mov     ecx,[esp+24]    ; uSourceLen
        mov     edi,[esp+28]    ; pOutBuf

        mov     ebx,base64lut   ; for xlatb

decode4bytes:
        cmp     ecx,4
        jle     final4

        sub     ecx,4                   ; four input bytes will be processed

        mov     eax,[esi]               ; get four input bytes
        add     esi,4

        xor     ebp,ebp                 ; clear ebp

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,26                  ; shift them to "top" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,20                  ; shift them to "top - 6 bits" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,14                  ; shift them to "top - 12 bits" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xor     edx,edx                 ; clear edx
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,8                   ; shift them to "top - 18 bits" of register
        or      ebp,edx                 ; store them in output register

        ; three decoded output bytes have been generated, which have to be stored

        mov     edx,ebp
        bswap   edx
        mov     [edi],dx
        shr     edx,16
        add     edi,2
        mov     [edi],dl
        inc     edi

        jmp     decode4bytes

final4:
        mov     eax,[esi]               ; get four input bytes

        xor     ebp,ebp                 ; clear ebp

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,26                  ; shift them to "top" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,20                  ; shift them to "top - 6 bits" of register
        or      ebp,edx                 ; store them in output register

        ; at this point, padding might occur
        cmp     al,0x3d
        jz      padding2

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,14                  ; shift them to "top - 12 bits" of register
        or      ebp,edx                 ; store them in output register

        ; at this point, padding might occur
        cmp     al,0x3d
        jz      padding1

        ; -------- decode one byte from input ---------
        xor     edx,edx                 ; clear edx
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,8                   ; shift them to "top - 18 bits" of register
        or      ebp,edx                 ; store them in output register

        ; three decoded output bytes have been generated, which have to be stored

        mov     edx,ebp
        bswap   edx
        mov     [edi],dx
        shr     edx,16
        add     edi,2
        mov     [edi],dl

        xor     eax,eax                 ; 0 means "no padding"
        jmp     cleanup

padding2:
        ; ebp contains only one valid output byte
        mov     edx,ebp
        shr     edx,24
        mov     [edi],dl

        mov     eax,2                   ; 2 means "2 bytes padding"
        jmp     cleanup

padding1:
        ; ebp contains only two valid output bytes
        mov     edx,ebp
        bswap   edx
        mov     [edi],dx

        mov     eax,1                   ; 1 means "1 byte padding"

cleanup:
        pop     ebp
        pop     edi
        pop     esi
        pop     ebx
        ret

section '.data' data readable writeable
    base64lut:  file 'dec.bin'
    


Description: 256-byte lookup table. Rename to dec.bin before use.
Download
Filename: dec.bin.txt
Filesize: 256 Bytes
Downloaded: 242 Time(s)

Post 16 Feb 2008, 12:01
View user's profile Send private message Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 294
Alphonso
Hi r22,
you could maybe try deleting all instances of
Code:
and dl,00111111b    
and just make your LUT 256bits such as
Code:
base64 db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
       db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
       db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
       db 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/',0    
Post 17 Feb 2008, 14:23
View user's profile Send private message Reply with quote
Nicodemus



Joined: 11 Feb 2008
Posts: 6
Nicodemus
Hi there!

I optimized the decoder (see post above). This version requires a 16-byte-aligned base64 source.
With this precondition, this version of the decoder provides a ~13.7 % speed increase against the previous one.

Note:
You will require the lookup-table-file "dec.bin" from my previous post to assemble this one.

Code:
; Nicodemus' base64 decoding algorithm
format MS COFF

section '.text' code readable executable

; base64 decoding (up to 310 MB/s on input side on P4 D 3400 MHz)
;
; 1. Compile with fasm (produces obj file)
; 2. Link this obj file to your app
; 3. declare the following:
;       extern "C" { unsigned int FromBase64(const char* const pbSource, const unsigned int uSourceLen, unsigned char *pOutBuf); }
; 4. Call the function
public _FromBase64

 _FromBase64:
        ; ----------- cdecl calling convention support
        push   ebx
        push   esi
        push   edi
        push   ebp

        mov     esi,[esp+20]    ; pbSource
        mov     ecx,[esp+24]    ; uSourceLen
        mov     edi,[esp+28]    ; pOutBuf

        cmp     ecx,0           ; nothing to do!
        jz      cleanup

        ;xor     ebx,ebx
        mov     ebx,base64lut   ; for xlatb

@16:
        cmp     ecx,16
        jl      @15orless

        sub     ecx,16
        movdqa  xmm0,[esi]              ; get 16 _aligned_ input bytes
        add     esi,16

        ; -------- bytes 0..3
        movd    eax,xmm0
        psrldq  xmm0,4

        xor     ebp,ebp                 ; clear ebp
        ; -------- decode one byte from input ---------
        ; mov    bl,al
        ; mov    dl[base64lut+ebx]
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,26                  ; shift them to "top" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,20                  ; shift them to "top - 6 bits" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,14                  ; shift them to "top - 12 bits" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xor     edx,edx                 ; clear edx
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shl     edx,8                   ; shift them to "top - 18 bits" of register
        or      ebp,edx                 ; store them in output register

        ; three decoded output bytes have been generated, which have to be stored
        bswap   ebp
        movd    mm0,ebp                 ; mm0 contains 3 valid bytes now


        ; -------- bytes 4..7
        movd    eax,xmm0
        psrldq  xmm0,4

        xor     ebp,ebp                 ; clear ebp
        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,26                  ; shift them to "top" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,20                  ; shift them to "top - 6 bits" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,14                  ; shift them to "top - 12 bits" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xor     edx,edx                 ; clear edx
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shl     edx,8                   ; shift them to "top - 18 bits" of register
        or      ebp,edx                 ; store them in output register

        ; three decoded output bytes have been generated, which have to be stored
        bswap   ebp
        movd    mm1,ebp
        psllq   mm1,24
        por     mm0,mm1                 ; mm0 contains 6 valid bytes now

        ; -------- bytes 8..11
        movd    eax,xmm0
        psrldq  xmm0,4

        xor     ebp,ebp                 ; clear ebp
        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,26                  ; shift them to "top" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,20                  ; shift them to "top - 6 bits" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,14                  ; shift them to "top - 12 bits" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xor     edx,edx                 ; clear edx
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shl     edx,8                   ; shift them to "top - 18 bits" of register
        or      ebp,edx                 ; store them in output register

        ; three decoded output bytes have been generated, which have to be stored
        bswap   ebp
        movd    mm1,ebp
        psllq   mm1,48
        por     mm0,mm1                 ; mm0 contains 8 valid bytes now


        and     ebp,0x00ff0000          ; select remaining (unstored) output byte
        shl     ebp,8                   ; move to high byte (will be low byte later)

        movntq  [edi],mm0               ; store 8 output bytes
        add     edi,8

        ; -------- bytes 12..15
        movd    eax,xmm0

        ; -------- decode one byte from input ---------
        xor     edx,edx
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,18                  ; shift them to "top - 8 bits" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xor     edx,edx
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,12                  ; shift them to "top - 14 bits" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xor     edx,edx
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,6                   ; shift them to "top - 20 bits" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xor     edx,edx                 ; clear edx
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        or      ebp,edx                 ; store them in output register

        ; _four_ decoded and unstored output bytes are in ebp now
        bswap   ebp
        movnti  [edi],ebp
        add     edi,4

        jmp     @16

@15orless:
        cmp     ecx,4
        jle     final4

        sub     ecx,4                   ; four input bytes will be processed

        mov     eax,[esi]               ; get four input bytes
        add     esi,4

        xor     ebp,ebp                 ; clear ebp

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,26                  ; shift them to "top" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,20                  ; shift them to "top - 6 bits" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,14                  ; shift them to "top - 12 bits" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xor     edx,edx                 ; clear edx
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shl     edx,8                   ; shift them to "top - 18 bits" of register
        or      ebp,edx                 ; store them in output register

        ; three decoded output bytes have been generated, which have to be stored

        mov     edx,ebp
        bswap   edx
        mov     [edi],dx
        shr     edx,16
        add     edi,2
        mov     [edi],dl
        inc     edi

        jmp     @15orless

final4:
        cmp     ecx,0
        xor     eax,eax                 ; no padding!
        jz      cleanup

        mov     eax,[esi]               ; get four input bytes

        xor     ebp,ebp                 ; clear ebp

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,26                  ; shift them to "top" of register
        or      ebp,edx                 ; store them in output register

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,20                  ; shift them to "top - 6 bits" of register
        or      ebp,edx                 ; store them in output register

        ; at this point, padding might occur
        cmp     al,0x3d
        jz      padding2

        ; -------- decode one byte from input ---------
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shl     edx,14                  ; shift them to "top - 12 bits" of register
        or      ebp,edx                 ; store them in output register

        ; at this point, padding might occur
        cmp     al,0x3d
        jz      padding1

        ; -------- decode one byte from input ---------
        xor     edx,edx                 ; clear edx
        xlatb                           ; decode one byte (yields 6 decoded bits)
        mov     dl,al
        shr     eax,8                   ; shift to next encoded byte
        shl     edx,8                   ; shift them to "top - 18 bits" of register
        or      ebp,edx                 ; store them in output register

        ; three decoded output bytes have been generated, which have to be stored

        mov     edx,ebp
        bswap   edx
        mov     [edi],dx
        shr     edx,16
        add     edi,2
        mov     [edi],dl

        xor     eax,eax                 ; 0 means "no padding"
        jmp     cleanup

padding2:
        ; ebp contains only one valid output byte
        mov     edx,ebp
        shr     edx,24
        mov     [edi],dl

        mov     eax,2                   ; 2 means "2 bytes padding"
        jmp     cleanup

padding1:
        ; ebp contains only two valid output bytes
        mov     edx,ebp
        bswap   edx
        mov     [edi],dx

        mov     eax,1                   ; 1 means "1 byte padding"

cleanup:
        emms
        pop     ebp
        pop     edi
        pop     esi
        pop     ebx
        ret

section '.data' data readable writable
    base64lut:  file 'dec.bin'
    
Post 17 Feb 2008, 20:15
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
I'm wondering: is this just for fun, or do you have some real-world examples where BASE64 needs a lot of speed? Smile
Post 17 Feb 2008, 20:20
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
MIME uses base64 to send attachments, so a fast base64 DEcoder could be optimal for a Virus scanner attached to a mail server.
Post 17 Feb 2008, 22:18
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17715
Location: In your JS exploiting you and your system
revolution
One thing to note with the code is that it needs SSE2. You might want to consider detecting if the CPU is actually capable of SSE2 and drop to alternative code for older CPUs.
Post 17 Feb 2008, 22:27
View user's profile Send private message Visit poster's website Reply with quote
Nicodemus



Joined: 11 Feb 2008
Posts: 6
Nicodemus
Actually, I prefer choosing between different implementations (e.g. SSE vs. MMX vs. classic code) by doing a CPU detection once at application start and selecting from a set of CPU-feature-specific libraries just there.

This avoids the overhead of doing this in every function, which is a performance hit.

Nevertheless, you are right: Both my algorithms above require SSE2; the 2nd version of my decoder also requires a 16-byte-aligned input byte array).
Post 18 Feb 2008, 06:49
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.