flat assembler
Message board for the users of flat assembler.

Index > Windows > Need higher analysis of function

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 16 Dec 2007, 16:52
My apologies for multiple threads, but I need some higher opinions of my code. I am attempting the SubWord() function of the Rijndael Cipher, and it was challenging to make in asm. There were other ways of doing it, but I chose to store the table in data and use an offset into this structure to do the substituting. I would love for anyone to look at this code, and please tell me what I can do to make it better, or any problems that you may see. The function is supposed to mask off all but the desired nibbles for the inner loop, which will track down and locate the appropriate offset into the structure. Here's the explanation if you would attempt to understand this code:
http://en.wikipedia.org/wiki/Rijndael_S-box
Thanks!
Code:

                ;===========================================
                ;==== Start SubWord()
                ;===========================================
                ;Substitute word function
                ;Takes a dword at a time, substitutes one byte at a time
                ;Loops, four bytes in eax are substituted -> eax
                ;preserves all regs except eax
                ;eax = row ebx = column ecx = i edx = nibble mask counter
                ;edx is calculated by shl 4 (bits) (0x000000001,0x00000010,ect..)
                SubWord:
                ;We are going to use ebx as column, ecx as i
                ;edx is used for the outer loop to mask off sequential nibbles
                push ebx
                push ecx
                push edx
                push ebp  ;for "another variable"
                push esi
                push edi

                ;xor ebx,ebx    ;ebx and edx are taken over anyway...
                xor ecx,ecx
                xor ebp,ebp
                xor esi,esi     ;esi is final holder, until end
                ;xor edx,edx
                mov edx,0x00000001  ;first outer iteration
                push eax  ;To preserve value until next loop, eax and ebx both masked
                ;Begin outer loop here!!
                ;=======================
                BeginOuterLoop:
                mov eax,[esp+0x4] ;original eax
                mov ebx,eax  ;get ebx ready to be masked
                ;eax contains the four bytes, ebx is a copy
                ;i is register ecx
                ;shift edx sequentially on every outer loop run to get
                ;      appropriate nibble
                ;Mask off here

                and ebx,edx ;we need this loop's column nibble
                shl edx,4
                and eax,edx ;we need this loop's row nibble
                shl edx,4 ;Get it ready for next outer loop
                ;Odd masking arith, but it needs to be done
                ;Explanation:
                ;       we needed the column nibble in ebx, row in eax
                ;       so, because the column nibble is the furthest right,
                ;       we had to mask it first, then shift the masking bit over
                ;       to get the row nibble (most significant nibble, left)
                FirstCompare:
                cmp eax,ecx ;compare row against i
                je AfterFirstInnerLoop
                inc ecx
                jmp FirstCompare

                AfterFirstInnerLoop:
                ;(i*16) + 1 -> ebp
                push eax
                push ebx
                push edx
                mov eax,ecx
                mov ebx,0x00000010 ;16
                mul ebx
                mov ebp,eax
                pop edx
                pop ebx
                pop eax
                inc ebp

                xor ecx,ecx ;reset i -> 0, row has been found, now columnloop

                SecondCompare:
                cmp ebx,ecx  ;compare column to i
                je AfterSecondInnerLoop
                inc ecx
                jmp SecondCompare

                AfterSecondInnerLoop:
                add ebp,ecx ;Add row offset (i) to "another variable"
                ;Now, do the byte change into eax
                ;For this, we will have a mem location for sbox/invsbox
                cmp [IsInv],0
                jne Inversed
                mov edi,[Sbox+ebp]     ;Store in temp reg
                or esi,edi ;put replaced byte into final holder(esi)
                cmp edx,0x10000000     ;Check if this is last iteration
                jne BeginOuterLoop
                jmp EndSubWord
                Inversed:
                mov edi,[InvSbox+ebp]
                or esi,edi ;puts replaced byte into final holder(esi)
                cmp edx,0x10000000
                jne BeginOuterLoop
                EndSubWord:
                ;Easy how one instruction does it all...
                mov eax,esi
                ;Put back regs
                add esp,0x4 ;Get rid of garbage eax from before outer loop
                pop edi
                pop esi
                pop ebp
                pop edx
                pop ecx
                pop ebx
                ret
                ;All registers except esp were used, all except eax restored
                ;===========================================
                ;=== End of SubWord()
                ;===========================================
    
Post 16 Dec 2007, 16:52
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20416
Location: In your JS exploiting you and your system
revolution 16 Dec 2007, 17:04
Code:
mov eax,[esp+0x4] ;original eax    
I think you might mean:
Code:
mov eax,[esp] ;original eax    


Code:
                push eax
                push ebx
                push edx
                mov eax,ecx
                mov ebx,0x00000010 ;16
                mul ebx
                mov ebp,eax
                pop edx
                pop ebx
                pop eax
                inc ebp    
This can be simplified
Code:
lea ebp,[ecx*8]
lea ebp,[ebp*2+1]    


Code:
cmp edx,0x10000000     ;Check if this is last iteration    
Seems you might never finish the loop, try:
Code:
cmp edx,0    ;Check if this is last iteration    
That's all for now. Just my 2 minute critique.
Post 16 Dec 2007, 17:04
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20416
Location: In your JS exploiting you and your system
revolution 16 Dec 2007, 17:16
Here's a present for you, it is a DLL I made may years ago. No comments or documentation so you'll have to work out how to use it on your own. I'm not even sure if it will still assemble on the latest fasm version. It is small because it generates all the table on the fly at startup.
Code:
;************************************************************************************************
;constants
;************************************************************************************************
min_key            = 4
max_key          = 8
min_blocks       = 4
max_blocks       = 8

;do_odd_blocks       =1
;do_odd_keys      =1

format PE GUI 4.0 DLL
entry RD_init
include 'win32a.inc'
b equ byte
d equ dword

;************************************************************************************************
;constants
;************************************************************************************************

rounds_off   = 6

max_rounds   = max_key+rounds_off
if max_blocks > max_rounds
  max_rounds   = max_blocks+rounds_off
end if

if defined do_odd_keys
  total_keys = max_key-min_key+1
else
  total_keys     = (max_key-min_key)/2+1
end if

if defined do_odd_blocks
  total_blocks     = max_blocks-min_blocks+1
else
  total_blocks     = (max_blocks-min_blocks)/2+1
end if

;************************************************************************************************
;structures
;************************************************************************************************
struct RD_STATE
     working_b       rd      max_blocks
  align 16
    rk              rd      max_blocks*(max_rounds+1)
   align 16
    rk_d            rd      max_blocks*(max_rounds+1)
   block_size      dd      ?
   key_size        dd      ?
   rounds          dd      ?
   encrypt_func    dd      ?
   decrypt_func    dd      ?
   clear_size      =       $
   handle          dd      ?
ends

struct RD_TABLES
   tab_s   rb      256
 tab_si  rb      256
 t1      rd      256
 t2      rd      256
 t3      rd      256
 t4      rd      256
 t5      rd      256
 t6      rd      256
 t7      rd      256
 t8      rd      256
 u1      rd      256
 u2      rd      256
 u3      rd      256
 u4      rd      256
 rcon    rd      30
ends

;************************************************************************************************
;global data
section '.udata' data readable writeable
;************************************************************************************************

align 16
RD_tables RD_TABLES

;************************************************************************************************
;macros
;************************************************************************************************
macro do_possible3a func,blk,rnd1,rnd2,rnd3,rnd4,rnd5
{
   do_next=0
   if min_blocks<=blk & max_blocks>=blk
       if (min_key<=4)
         func blk,4,rnd1
             if rnd1=0
           do_next=1
           end if
       end if
      if (min_key<=5 & max_key>=5 & defined do_odd_keys) | (do_next)
               func blk,5,rnd2
             do_next=0
           if rnd2=0
           do_next=1
           end if
       end if
      if (min_key<=6 & max_key>=6) | (do_next)
         func blk,6,rnd3
             do_next=0
           if rnd3=0
           do_next=1
           end if
       end if
      if (min_key<=7 & max_key>=7 & defined do_odd_keys) | (do_next)
               func blk,7,rnd4
             do_next=0
           if rnd4=0
           do_next=1
           end if
       end if
      if (max_key>=8) | (do_next)
             func blk,8,rnd5
      end if
     end if
}

macro do_possibles3 func
{
  do_possible3a func,4,10,11,12,13,14
 if defined do_odd_blocks
     do_possible3a func,5,11,11,12,13,14
        end if
      do_possible3a func,6,12,12,12,13,14
 if defined do_odd_blocks
     do_possible3a func,7,13,13,13,13,14
        end if
      do_possible3a func,8,14,14,14,14,14
}

macro do_possibles2 func
{
     do_possible3a func,4,10,11,12,13,14
 if defined do_odd_blocks
     do_possible3a func,5,00,11,12,13,14
        end if
      do_possible3a func,6,00,00,12,13,14
 if defined do_odd_blocks
     do_possible3a func,7,00,00,00,13,14
        end if
      do_possible3a func,8,00,00,00,00,14
}

macro define_shifts blks
{
     sft1=1
      sft2=2  
    sft3=3
      if blks > 6
       sft3=4
     end if
      if blks > 7
       sft2=3
     end if
}

;************************************************************************************************
section '.code' code readable executable
;************************************************************************************************

GF_modulus  =01bh
GF_affine      =01fh
GF_magic       =063h

proc RD_init,hinstDLL,fdwReason,lpvReserved
    enter
       push    ebx esi edi ebp
;************************************************************************************************
;the log and alog tables are used to make the S box. Afterwards they are overwritten by t2
;************************************************************************************************
alog_table   =RD_tables.t2
log_table      =RD_tables.t2+256
;************************************************************************************************
;make alog and log tables
;************************************************************************************************
    mov     eax,1           ;the first value is 1
       mov     b[alog_table],al
    mov     b[log_table],0  ;actually log(0) is not used
        mov     ecx,eax
.alog:
;multiply AL by 3 in GF(2^Cool
       mov     bh,al
       add     bh,bh
       sbb     bl,bl
       and     bl,GF_modulus
       xor     bh,bl
       xor     al,bh
       mov     b[ecx+alog_table],al
        mov     b[eax+log_table],cl
 add     cl,1
        jnc     .alog
       mov     b[eax+log_table],cl     ;set the last value mod 255
;************************************************************************************************
;make S box and inverse S box
;************************************************************************************************
      mov     edx,255
     mov     ecx,GF_magic
;the multiplicative inverse of 0 needs special handling (log and alog can't deal with 0)
   mov     b[000+RD_tables.tab_s],cl
   mov     b[ecx+RD_tables.tab_si],0
.sbox: movzx   eax,b[edx+log_table]
        not     al
  mov     al,b[eax+alog_table]    ;AL=multiplicative inverse of DL
    mov     bl,GF_affine
        mov     cl,GF_magic
.sbox2:      shr     al,1
        sbb     bh,bh
       and     bh,bl
       xor     cl,bh
       rol     bl,1
        test    al,al
       jnz     .sbox2
      mov     [edx+RD_tables.tab_s],cl
    mov     [ecx+RD_tables.tab_si],dl
   sub     edx,1
       jnz     .sbox
;************************************************************************************************
;make t5, t6, t7, t8, u1, u2, u3 & u4
;************************************************************************************************
        xor     esi,esi
     xor     edx,edx
     mov     cl,b[log_table+0eh]
 mov     ch,b[log_table+09h]
 mov     bl,b[log_table+0dh]
 mov     bh,b[log_table+0bh]
.ti: movzx   eax,b[esi+RD_tables.tab_si]
 lea     edi,[eax*4]
 test    eax,eax
     movzx   ebp,al
      jz      .ti2
        mov     al,b[eax+log_table]
 mov     dl,al
       add     dl,cl
       adc     dl,0
        mov     dl,b[edx+alog_table]
        shrd    ebp,edx,8
   mov     dl,al
       add     dl,ch
       adc     dl,0
        mov     dl,b[edx+alog_table]
        shrd    ebp,edx,8
   mov     dl,al
       add     dl,bl
       adc     dl,0
        mov     dl,b[edx+alog_table]
        shrd    ebp,edx,8
   add     al,bh
       adc     al,0
        mov     al,b[eax+alog_table]
        shrd    ebp,eax,8
.ti2:  mov     [esi*4+RD_tables.t5],ebp
    mov     [edi+RD_tables.u1],ebp
      rol     ebp,8
       mov     [esi*4+RD_tables.t6],ebp
    mov     [edi+RD_tables.u2],ebp
      rol     ebp,8
       mov     [esi*4+RD_tables.t7],ebp
    mov     [edi+RD_tables.u3],ebp
      rol     ebp,8
       mov     [esi*4+RD_tables.t8],ebp
    mov     [edi+RD_tables.u4],ebp
      add     esi,1
       test    esi,0ffh
    jnz     .ti
;************************************************************************************************
;make t1, t2, t3 & t4
;************************************************************************************************
  xor     edx,edx
.t:      movzx   eax,b[edx+RD_tables.tab_s]
;the log and alog tables will be destroyed so the multiplications are done directly
;multiply AL by 2 in GF(2^Cool
   mov     ah,al
       add     ah,ah
       sbb     bl,bl
       and     bl,GF_modulus
       xor     ah,bl   ;ah=al*2
;multiply AL by 3 in GF(2^Cool
    mov     ch,al
       xor     ch,ah   ;ch=al*3
;make the value
 mov     cl,al
       shl     ecx,16
      mov     ch,al
       mov     cl,ah
       mov     [edx*4+RD_tables.t1],ecx
    rol     ecx,8
       mov     [edx*4+RD_tables.t2],ecx
    rol     ecx,8
       mov     [edx*4+RD_tables.t3],ecx
    rol     ecx,8
       mov     [edx*4+RD_tables.t4],ecx
    add     dl,1
        jnc     .t
;************************************************************************************************
;make rcon
;************************************************************************************************
  xor     edx,edx
     mov     eax,1
.rcon:     mov     [edx*4+RD_tables.rcon],eax
;mulitply AL by 2 in GF(2^Cool
  add     al,al
       sbb     bl,bl
       and     bl,GF_modulus
       xor     al,bl   ;al=al*2
    add     dl,1
        cmp     dl,30
       jb      .rcon
       pop     ebp edi esi ebx
     mov     eax,TRUE
    return
endp

proc RD_clear_state,state
     enter
       push    ebx edi
     mov     eax,-1
      mov     ebx,[state]
 mov     edx,RD_STATE.clear_size
     shr     edx,2
       mov     edi,ebx
     mov     ecx,edx
     rep     stosd
       mov     eax,0aaaaaaaah
      mov     edi,ebx
     mov     ecx,edx
     rep     stosd
       not     eax     ;055555555h
 mov     edi,ebx
     mov     ecx,edx
     rep     stosd
       xor     eax,eax
     mov     edi,ebx
     mov     ecx,edx
     rep     stosd
       pop     edi ebx
     return
endp

proc RD_free_state,state
      enter
       stdcall RD_clear_state,[state]
      mov     eax,[state]
 invoke  VirtualUnlock,[eax+RD_STATE.handle],sizeof.RD_STATE+16
      test    eax,eax
     jz      .done
       mov     eax,[state]
 invoke  VirtualFree,[eax+RD_STATE.handle],sizeof.RD_STATE+16,MEM_DECOMMIT+MEM_RELEASE
.done:     return
endp

proc RD_new_state,block_size,key_size
 enter
       xor     eax,eax
     mov     ecx,[block_size]
    sub     ecx,min_blocks
    if total_blocks>1
  jb      .done
       cmp     ecx,max_blocks-min_blocks
   ja      .done
      if ~defined do_odd_blocks
    shr     ecx,1
       jc      .done
      end if
    else
   jnz     .done
    end if
 mov     edx,[key_size]
      sub     edx,min_key
    if total_keys>1
       jb      .done
       cmp     edx,max_key-min_key
 ja      .done
      if ~defined do_odd_keys
      shr     edx,1
       jc      .done
      end if
    else
   jnz     .done
    end if
    if total_keys=2
  lea     edx,[edx+ecx*2]
    else if total_keys=3
 lea     ecx,[ecx*2+ecx]
     add     edx,ecx
    else if total_keys=4
 lea     edx,[edx+ecx*4]
    else if total_keys=5
 lea     ecx,[ecx*4+ecx]
     add     edx,ecx
    end if
       push    edx
 invoke  VirtualAlloc,NULL,sizeof.RD_STATE+16,MEM_COMMIT,PAGE_READWRITE
      test    eax,eax
     jz      .done
       push    eax
;lock the memory so it doesn't get put into the paging file
 invoke  VirtualLock,eax,sizeof.RD_STATE+16
  test    eax,eax
     pop     eax
 pop     edx
 jz      .release
    lea     ecx,[eax+0fh]
       and     ecx,not 0fh
 mov     [ecx+RD_STATE.handle],eax
   mov     eax,[edx*8+RD_jump_table+0]
 mov     edx,[edx*8+RD_jump_table+4]
 mov     [ecx+RD_STATE.encrypt_func],eax
     mov     [ecx+RD_STATE.decrypt_func],edx
     mov     eax,[block_size]
    mov     edx,[key_size]
      mov     [ecx+RD_STATE.block_size],eax
       mov     [ecx+RD_STATE.key_size],edx
 cmp     eax,edx
     lea     eax,[eax+rounds_off]
        ja      .a
  lea     eax,[edx+rounds_off]
.a: mov     [ecx+RD_STATE.rounds],eax
   mov     eax,ecx
.done:   return
.release:
     invoke  VirtualFree,eax,sizeof.RD_STATE+16,MEM_DECOMMIT+MEM_RELEASE
 xor     eax,eax
     return
endp

proc RD_expand_key,state,input_key
.loop_stop  dd      ?
.loop_rcon dd      ?
   enter
       push    ebx esi edi ebp
     mov     esi,[input_key]
     mov     ebp,[state]
 lea     edi,[ebp+RD_STATE.rk]
       mov     ecx,[ebp+RD_STATE.key_size]
 rep     movsd
;eax=scratch
;ebx=secondary compare for (loop counter mod key_size)
;ecx=loop counter
;edx=loop counter mod key_size
;esi=-key_size*4
;edi=offset to current rk
;ebp=STATE
;make the encryption round keys
    mov     eax,[ebp+RD_STATE.rounds]
   add     eax,1
       imul    eax,[ebp+RD_STATE.block_size]
       sub     eax,[ebp+RD_STATE.key_size]
 mov     [.loop_stop],eax
    xor     ecx,ecx
     xor     ebx,ebx
     cmp     [ebp+RD_STATE.key_size],6
   setge   bl
  shl     ebx,2
       xor     edx,edx
     mov     [.loop_rcon],RD_tables.rcon
 mov     esi,[ebp+RD_STATE.key_size]
 shl     esi,2
       neg     esi
.a:  mov     eax,[edi-4]
 test    edx,edx
     jnz     .b
  movzx   edx,al
      mov     al,[edx+RD_tables.tab_s]
    ror     eax,8
       movzx   edx,al
      mov     al,[edx+RD_tables.tab_s]
    ror     eax,8
       movzx   edx,al
      mov     al,[edx+RD_tables.tab_s]
    ror     eax,8
       movzx   edx,al
      mov     al,[edx+RD_tables.tab_s]
    ror     eax,16          ;rotbyte done here
  mov     edx,[.loop_rcon]
    xor     eax,[edx]       ;rcon
       xor     edx,edx         ;reset the counter
  jmp     .c
.b:   cmp     edx,ebx
     jnz     .c
  movzx   edx,al
      mov     al,[edx+RD_tables.tab_s]
    rol     eax,8
       movzx   edx,al
      mov     al,[edx+RD_tables.tab_s]
    rol     eax,8
       movzx   edx,al
      mov     al,[edx+RD_tables.tab_s]
    rol     eax,8
       movzx   edx,al
      mov     al,[edx+RD_tables.tab_s]
    rol     eax,8
       mov     edx,ebx         ;reset the counter
.c:   xor     eax,[edi+esi]
       mov     [edi],eax
   add     edi,4
       add     edx,1
       cmp     edx,[ebp+RD_STATE.key_size]
 jb      .d
  xor     edx,edx
     add     [.loop_rcon],4
.d:       add     ecx,1
       cmp     ecx,[.loop_stop]
    jb      .a
;make the decryption round keys
       lea     esi,[ebp+RD_STATE.rk]
       lea     edi,[ebp+RD_STATE.rk_d]
     mov     ecx,[ebp+RD_STATE.key_size]
 rep     movsd
       mov     ebp,[.loop_stop]
    sub     edi,esi
.e:      movzx   eax,b[esi]
  movzx   ebx,b[esi+1]
        movzx   ecx,b[esi+2]
        movzx   edx,b[esi+3]
        mov     eax,[eax*4+RD_tables.u1]
    xor     eax,[ebx*4+RD_tables.u2]
    xor     eax,[ecx*4+RD_tables.u3]
    xor     eax,[edx*4+RD_tables.u4]
    mov     [esi+edi],eax
       add     esi,4
       sub     ebp,1
       jnz     .e
  pop     ebp edi esi ebx
     return
endp

macro copy_data blks,source,dest
{
    repeat (blks+3)/4
   mov     eax,[source+(%-1)*16]
       if (blks-(%-1)*4) >= 2
   mov     ebx,[source+(%-1)*16+4]
     end if
      if (blks-(%-1)*4) >= 3
   mov     ecx,[source+(%-1)*16+8]
     end if
      if (blks-(%-1)*4) >= 4
   mov     edx,[source+(%-1)*16+12]
    end if
      mov     [dest+(%-1)*16],eax
 if (blks-(%-1)*4) >= 2
   mov     [dest+(%-1)*16+4],ebx
       end if
      if (blks-(%-1)*4) >= 3
   mov     [dest+(%-1)*16+8],ecx
       end if
      if (blks-(%-1)*4) >= 4
   mov     [dest+(%-1)*16+12],edx
      end if
      end repeat
}

macro keyaddition blks
{
;esi=a
;edi=rk
   repeat (blks+3)/4
   mov     eax,[edi+(%-1)*16]
  if (blks-(%-1)*4) >= 2
   mov     ebx,[edi+(%-1)*16+4]
        end if
      if (blks-(%-1)*4) >= 3
   mov     ecx,[edi+(%-1)*16+8]
        end if
      if (blks-(%-1)*4) >= 4
   mov     edx,[edi+(%-1)*16+12]
       end if
      xor     [esi+(%-1)*16],eax
  if (blks-(%-1)*4) >= 2
   xor     [esi+(%-1)*16+4],ebx
        end if
      if (blks-(%-1)*4) >= 3
   xor     [esi+(%-1)*16+8],ecx
        end if
      if (blks-(%-1)*4) >= 4
   xor     [esi+(%-1)*16+12],edx
       end if
      end repeat
}

macro tablelookup_key_add blks,source,dest,key_off
{
    repeat blks
 movzx   eax,b[source+((0000+%-1) mod blks)*4+0]
     movzx   ebx,b[source+((sft1+%-1) mod blks)*4+1]
     movzx   ecx,b[source+((sft2+%-1) mod blks)*4+2]
     mov     eax,[eax*4+RD_tables.t1]
    xor     eax,[ebx*4+RD_tables.t2]
    movzx   ebx,b[source+((sft3+%-1) mod blks)*4+3]
     xor     eax,[ecx*4+RD_tables.t3]
    xor     eax,[edi+(%-1)*4+key_off]
   xor     eax,[ebx*4+RD_tables.t4]
    mov     [dest+(%-1)*4],eax
  end repeat
}

macro tablelookupf_key_add blks,source,dest,key_off
{
   repeat blks
 movzx   eax,b[source+((sft2+%-1) mod blks)*4+2]
     movzx   ebx,b[source+((sft3+%-1) mod blks)*4+3]
     movzx   ecx,b[source+((0000+%-1) mod blks)*4+0]
     movzx   edx,b[source+((sft1+%-1) mod blks)*4+1]
     mov     al,b[eax+RD_tables.tab_s]
   mov     ah,b[ebx+RD_tables.tab_s]
   shl     eax,16
      mov     al,b[ecx+RD_tables.tab_s]
   mov     ah,b[edx+RD_tables.tab_s]
   xor     eax,[edi+(%-1)*4+key_off]
   mov     [dest+(%-1)*4],eax
  end repeat
}

macro encrypt_block blks,keys,rounds
{
align 16
encrypt_block_#blks#_#keys#:
if rounds>0
        define_shifts blks
  keyaddition blks
    mov     edx,(rounds-1)
      add     edi,blks*4
.a:   tablelookup_key_add blks,esi,ebp,0
  xchg    esi,ebp
     add     edi,blks*4
  sub     edx,1
       jnz     .a
  tablelookupf_key_add blks,esi,ebp,0
 if (rounds and 1)
     copy_data blks,ebp,esi
    end if
      ret
end if
}
do_possibles2 encrypt_block

macro tablelookup2_key_add blks,source,dest,key_off
{
 repeat blks
 movzx   eax,b[source+((000000000+%-1) mod blks)*4+0]
        movzx   ebx,b[source+((blks-sft1+%-1) mod blks)*4+1]
        movzx   ecx,b[source+((blks-sft2+%-1) mod blks)*4+2]
        mov     eax,[eax*4+RD_tables.t5]
    xor     eax,[ebx*4+RD_tables.t6]
    movzx   ebx,b[source+((blks-sft3+%-1) mod blks)*4+3]
        xor     eax,[ecx*4+RD_tables.t7]
    xor     eax,[edi+(%-1)*4-key_off]
   xor     eax,[ebx*4+RD_tables.t8]
    mov     [dest+(%-1)*4],eax
  end repeat
}

macro tablelookup2f_key_add blks,source,dest,key_off
{
  repeat blks
 movzx   eax,b[source+((blks-sft2+%-1) mod blks)*4+2]
        movzx   ebx,b[source+((blks-sft3+%-1) mod blks)*4+3]
        movzx   ecx,b[source+((000000000+%-1) mod blks)*4+0]
        movzx   edx,b[source+((blks-sft1+%-1) mod blks)*4+1]
        mov     al,[eax+RD_tables.tab_si]
   mov     ah,[ebx+RD_tables.tab_si]
   shl     eax,16
      mov     al,[ecx+RD_tables.tab_si]
   mov     ah,[edx+RD_tables.tab_si]
   xor     eax,[edi+(%-1)*4-key_off]
   mov     [dest+(%-1)*4],eax
  end repeat
}

macro decrypt_block blks,keys,rounds
{
align 16
decrypt_block_#blks#_#keys#:
if rounds>0
        define_shifts blks
  add     edi,rounds*blks*4
   keyaddition blks
    mov     edx,(rounds-1)
      sub     edi,blks*4
.a:   tablelookup2_key_add blks,esi,ebp,0
 xchg    esi,ebp
     sub     edi,blks*4
  sub     edx,1
       jnz     .a
  tablelookup2f_key_add blks,esi,ebp,0
        if (rounds and 1)
     copy_data blks,ebp,esi
    end if
      ret
end if
}
do_possibles2 decrypt_block

macro jump_table blks,keys,rounds
{
   dd      encrypt_block_#blks#_#keys,decrypt_block_#blks#_#keys
}

align 8
RD_jump_table:
do_possibles3 jump_table

proc RD_encrypt_block,state,block_pointer
   enter
       push    ebx esi edi ebp
     mov     eax,[state]
 mov     esi,[block_pointer]
 lea     edi,[eax+RD_STATE.rk]
       lea     ebp,[eax+RD_STATE.working_b]
        call    [eax+RD_STATE.encrypt_func]
 pop     ebp edi esi ebx
     return
endp

proc RD_encrypt_blocks,state,block_pointer,blocks
     enter
       push    ebx esi edi
 mov     eax,[state]
 mov     esi,[block_pointer]
.a:  lea     edi,[eax+RD_STATE.rk]
       lea     ebp,[eax+RD_STATE.working_b]
        push    ebp
 call    [eax+RD_STATE.encrypt_func]
 pop     ebp
 mov     eax,[state]
 mov     esi,[block_pointer]
 mov     ecx,[blocks]
        mov     edx,[eax+RD_STATE.block_size]
       lea     esi,[esi+edx*4]
     mov     [block_pointer],esi
 sub     ecx,1
       mov     [blocks],ecx
        jnz     .a
  pop     edi esi ebx
 return
endp

proc RD_decrypt_block,state,block_pointer
     enter
       push    ebx esi edi ebp
     mov     eax,[state]
 mov     esi,[block_pointer]
 lea     edi,[eax+RD_STATE.rk_d]
     lea     ebp,[eax+RD_STATE.working_b]
        call    [eax+RD_STATE.decrypt_func]
 pop     ebp edi esi ebx
     return
endp

proc RD_decrypt_blocks,state,block_pointer,blocks
     enter
       push    ebx esi edi
 mov     eax,[state]
 mov     esi,[block_pointer]
.a:  lea     edi,[eax+RD_STATE.rk]
       lea     ebp,[eax+RD_STATE.working_b]
        push    ebp
 call    [eax+RD_STATE.decrypt_func]
 pop     ebp
 mov     eax,[state]
 mov     esi,[block_pointer]
 mov     ecx,[blocks]
        mov     edx,[eax+RD_STATE.block_size]
       lea     esi,[esi+edx*4]
     mov     [block_pointer],esi
 sub     ecx,1
       mov     [blocks],ecx
        jnz     .a
  pop     edi esi ebx
 return
endp

section '.idata' import data readable writeable

library kernel,'KERNEL32.DLL'

import kernel,\
      VirtualAlloc,'VirtualAlloc',\
    VirtualFree,'VirtualFree',\
      VirtualLock,'VirtualLock',\
      VirtualUnlock,'VirtualUnlock'

section '.edata' export data readable

export 'RIJNDAEL.DLL',\
    RD_decrypt_block,'RijndaelDecryptBlock',\
        RD_decrypt_blocks,'RijndaelDecryptBlocks',\
      RD_encrypt_block,'RijndaelEncryptBlock',\
        RD_encrypt_blocks,'RijndaelEncryptBlocks',\
      RD_expand_key,'RijndaelExpandKey',\
      RD_free_state,'RijndaelFree',\
   RD_new_state,'RijndaelOpen',\
    RD_clear_state,'RijndaelWash'

section '.reloc' fixups data discardable    
If you find it useful let me know.

PS. you may need to alter the structure defs, because I used a custom structure macro back then when I wrote it.
Post 16 Dec 2007, 17:16
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 16 Dec 2007, 18:13
Thanks for checking it out- I will take a look at that code. I just located another asm implementation, and it is totally linear encrypt/decrypt in 16-bit mode. Very small, but I saw many odd instructions which turned out to be a lot of stuff in other files, more c code, ect.. so I didn't look anymore. I was thinking there was another way to do the multiplication you fixed above, thanks.

Okay, here's my next "challenge". Please criticize and riticule* to no end so I can make it more optimized! I have tried it, and it is working for now. It uses a substitution table (SBox) and, based on nibbles given byte by byte in eax, substitutes them from the table. Please refer to my earlier url to see the example. It is the simplest thing to do in real life, but a pain in asm. The point is to use the first nibble for the row of the table, the second nibble for the column, and replace the byte with the byte of the table in that location. Have fun!!

Code:
                ;===========================================
                ;==== Start SubWord()
                ;============================================
                ;Substitute word function
                ;Takes a dword at a time, substitutes one byte at a time
                ;Loops, four bytes in eax are substituted -> eax
                ;preserves all regs except eax
                ;IsInv data byte is set before function, determines which table 
                     to use
                ;eax = current row nibble 
                ;ebx = current column nibble
                ;ecx (cl) = mask counter
                ;edx = nibble mask
                ;ebp = original eax passed to function

                SubWord:

                push ecx
                push edx
                push edi
                push esi
                push ebx
                push ebp

                mov ebp,eax           ;Used on every iteration
                mov edx,0x0000000F ;Initial mask setting
                xor ecx,ecx
                xor esi,esi

                BeginOuterLoop:
                mov eax,ebp      ;Original eax
                mov ebx,ebp      ;Original eax
                and ebx,edx      ;ebx -> column mask
                shl edx,4        ;Increment masker 1 position
                shr ebx,cl       ;Obtain value from mask -> ebx

                and eax,edx      ;eax -> row mask
                shl edx,4        ;Increment masker 1 position
                add cl,4
                shr eax,cl       ;Obtain value from mask -> eax
                add cl,4

                ;We need to multiply row pointer by 16 to obtain row offset
                ;(because there's 16 bytes in every row)
                lea eax,[eax*8]
                lea eax,[eax*2]

                ;Now calculate the substitution
                add eax,ebx ;Add the row offset to the column offset
                cmp [IsInv],0
                jne Inversed
                mov edi,DWORD [SBox + eax]
                jmp AfterInv
                Inversed:
                mov edi,DWORD [InvSBox + eax]
                AfterInv:
                and edi,0x000000FF ;Mask off upper bits
                cmp cl,8
                je First
                sub cl,8
                shl edi,cl
                add cl,8
                First:
                or esi,edi ;Place replaced bits in esi
                cmp edx,0
                jne BeginOuterLoop

                ;Finishing sequence
                mov eax,esi
                pop ebp
                pop ebx
                pop esi
                pop edi
                pop edx
                pop ecx
                ret
                ;===========================================
                ;=== End of SubWord()
                ;===========================================
    


WOW!!! I just got done of a full day of coding (I'm only a early teen, don't make fun of me) and spent about 4 hours debugging and optimising this code. Not the one above, I finally got a Rijndael key expansion (128,192, and 256-bit) program going, I ran it through multiple "testing" keys and all the results matched!!! I feel so good that I get to play with it tomorrow and learn how to use scanf with hex characters.... shall be fun!! thanks anyone who bothered to help me!!!! Smile
Post 16 Dec 2007, 18:13
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 20 Dec 2007, 02:19
Is anyone still there? I was just wondering if the % modulus symbol in fasm is made into a lot of code, I see it in the dll above
Post 20 Dec 2007, 02:19
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20416
Location: In your JS exploiting you and your system
revolution 20 Dec 2007, 04:01
% gives the current repeat count, mod gives a modulus. This is not C.
Post 20 Dec 2007, 04:01
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 20 Dec 2007, 22:21
Ohh yea, I forgot about that in the FASM manual. I've been pouring over asm and C implementations of Rijndael and it is driving me nuts finding an optimized MixColumns or InvMixColumns nonetheless. So, I'll quick check if the "mod" actually expands to a div instruction in the code. Thanks-
PS: if anyone can find a MixColumns in asm, please show me site. Rijndael dll above is nice, and probably works, but I am doing it just for education purposes and want to keep it in "function form"... aka not inlined.

NEW QUESTION: Does the entry point in a DLL get executed whenever a function in the dll is called? How does that work? Would this work?

Code:

format PE GUI 4.0 DLL
entry Init

;Later
Init:
       push ebp
       mov ebp,esp
       pusha
       ret

;All my functions already have above code done for them?

    
Post 20 Dec 2007, 22:21
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20416
Location: In your JS exploiting you and your system
revolution 21 Dec 2007, 00:44
The code I posted above does all the mix columns and other things.
Code:
        movzx   eax,b[source+((sft2+%-1) mod blks)*4+2]    
It just does it an optimised way which is what you are looking for.

Also referring to my code above, the DLL entry is called whenever it is mapped into, or out of, an address space. The reason for calling is passed in fdwReason. Well behaved code should check this parameter and act accordingly. Notice that my code is not well behaved because I completely ignore the value in fdwReason and just continue to compute the tables.
Post 21 Dec 2007, 00:44
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 21 Dec 2007, 01:26
Okay... So the entry is called once every time the dll is loaded? So the dll executes this code on startup, and then when the first function is called from the dll that code has been initiated? -Getting confused here-... I will look at those pieces of code there, my implementation of mixcolumns takes up dozens of lines of code, and is very confusing. It appears that you use some lookup tables called T1, T2, ect...What are those tables used for? I have read about log and alog tables for multiplication, but what are those?. Could you describe that code's functionality a little more? It looks like the section containing that code does a 'repeat blks' instruction and works on it a dword at a time. I use a very long, complex xoring operation that has the desired results as the documentation. I still have to debug the entire cipher/decipher, but it is put together. I'll have a lot to do, but thankfully we have off of school all next week. I'll get it put together and try using some of your techniques if I can understand them Smile I'll post more questions about it here when I have them. Thanks again
Post 21 Dec 2007, 01:26
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4060
Location: vpcmpistri
bitRAKE 21 Dec 2007, 05:03
AlexP wrote:
Would this work?

Code:
       push ebp
       mov ebp,esp
       pusha
       ret    
This code is strange. The flags are put on the stack and then RET uses the flag value as the return address. Oooh, scary - I can't see how this would be the intended function.

I don't know much about DLL's, but what I've read seems to indicate that the entry is called only after the DLL is mapped into a process' address space. The process that uses the DLL will call the functions within the DLL directly - Init will have taken place prior to any possible use of DLL functions by that process.

So, if I understand correctly, the code you suggest will not be required. The DLL Init is used to establish the DLL within the process - kind of global one time things that the other functions might need. Not the local function specific stuff your code suggests.


Last edited by bitRAKE on 21 Dec 2007, 06:03; edited 1 time in total
Post 21 Dec 2007, 05:03
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 21 Dec 2007, 05:18
Quote:

The flags are put on the stack and then RET

Actually GPR registers, it is pusha, not pushf Wink

I don't understand how could this be intended behavior neither Confused
Post 21 Dec 2007, 05:18
View user's profile Send private message Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 21 Dec 2007, 22:48
Sry- I've got it now. I was just confused because I didn't notice that was only about 3 lines of his, not several dozen. I've decided on having 4 sections of code: .Keys, .Tables, .Rijndael, and .Exports. Everything has been figured out as to how the entry point in a dll is called, thanks for help. I'm still looking for a fast algo for calculating the tables on runtime, I may do that or just put the tables in the .Tables section. I noticed that Revolution claimed his dll was smaller, but when I looked he reserved the same amount of memory as would have if the tables were already there. This lead me to another kind of dumb question: So if you have RD 256 the program reserves that much at runtime, compared to having a 256-dword data table, right? I've never had to create libraries much before so I believe it is a stupid question. I guess that the actual dll file is smaller with reserving data, then calculating on runtime, compared to having all the kilos of tables packed into the dll file. I will most likely start off with just having the tables hard-coded and then later calculating them when I can get the routines working. Any suggestions or comments?

PS: I used a push ebp, ect.. and then pusha because I usually do that for debug. I just used ebp to get the parameters, then pushed all the GPR's just to make debug first. Good analysis though
Post 21 Dec 2007, 22:48
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4060
Location: vpcmpistri
bitRAKE 21 Dec 2007, 23:06
Generating values at runtime costs additional space for code - sometimes this is worse than just storing the initial data. If coding for a small memory footprint then both code and data size must be concidered. Also, it's possible to save additional memory with readonly data in a DLL because when multiple processes are using the DLL the data only has to be mapped into each process - whereas changing memory resides in each process differently. I'm sure there is a way to have memory that 'belongs' to the DLL, but I don't know how.

A good example is the data for a window class - it takes more code to setup the local structure than to just have a global structure aleady initialized, and change the couple items. For some people it is a matter of style - keeping the data defined where it is used. Which is very understandable from a maintenance perspective.
Post 21 Dec 2007, 23:06
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 22 Dec 2007, 00:58
If anyone can find me a useful definition of how to use lookup tables for multiplication in GF(2^Cool, it would be greatly appreciated. I have found several (you might find Samiam.org) and all they do is tell the general concept. I have been looking everywhereeeee and cannot find anyone who can explain how to calculate the tables, but I guess I can figure it out by myself in a few days. Please, if Revolution is there, tell me how you did it!!! Most of my previously "optimized" functions like SubWord above I have fit in only a dozen instructions or less. I would like to do the same for MixColumns and it's inverse, but the only thing holding me back is not knowing which T table is which, or what the U tables are for, ect... Please tell me the secret Revolution!!
Post 22 Dec 2007, 00:58
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4060
Location: vpcmpistri
bitRAKE 22 Dec 2007, 01:35
Post 22 Dec 2007, 01:35
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20416
Location: In your JS exploiting you and your system
revolution 22 Dec 2007, 02:26
AlexP wrote:
... and cannot find anyone who can explain how to calculate the tables, but I guess I can figure it out by myself in a few days.
That is a good idea, take a few days to study the GF thing. Once you understand it it will be with you forever.
AlexP wrote:
Please, if Revolution is there, tell me how you did it!!! ... Please tell me the secret Revolution!!
My code is there exposed to the world. Nothing hidden, no secrets involved. Even the original documents for Rijndael give an explanation on precisely how GF works. Much better than I can explain here.
AlexP wrote:
... the only thing holding me back is not knowing which T table is which, or what the U tables are for, ect...
These tables and what they are for are also described nicely in the original Rijndael paper (ORP). The tables are only there for speed up and optimisation purposes. You can also leave them out, but the resulting code will be much slower. See the part about smart card implementations in the ORP.

There are 8xT tables and 4xU tables. If you are keen to save memory you can cut this down to 2xT and 1xU because the other tables are just rotations of the primary table. That is, (.t1) = rot8(.t2) = rot16(.t3) = rot24(.t4) and similarly for .t5-->.t8 and .u1-->.u4. Of course then you have to do the rotation in the code. It is a trade-off between speed and memory usage. Try it in different ways, no tables, 3 tables, all 12 tables, etc. and see which you are most comfortable with.

The S, SI, T and U tables contain a lot of the algorithm in precomputed form. I already mentioned the rotations, but also the log and alog are in there and a few other things.

BTW: I think my implementation of calculating the tables is unique, or at least if not unique, I can't find anywhere where someone has posted it. All of the other posted codes I have seen just use the tables given in the reference source code. You can also just use the precomputed tables in hundreds of boring lines of dd directives in the source. See the attachment. But it is more fun to create your own Wink
Post 22 Dec 2007, 02:26
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 22 Dec 2007, 04:23
Hmmm.. i thought that all of those tables were for the multiplication lookup, when you call MixColumns and have to do the finite multiplying, I thought the tables were a collection of all 256 possibilities of 3,1,1,2, and 0e,0b,0d, and 09. I know the roundconstant,s-box,invs-box but I cannot think of any finite field multiplying that has to be done outside of the MixColumns steps, or the two log tables to make the s-boxes. Maybe you need two for the normal s-box, two for the inv-sbox. That would make sense.... I'll read through some more stuff and try to figure it out, don't bother replying if you feel like giving up on me Smile -Thanks for code

PS: Would you mind if I used your implementation? I tried making a couple tables myself and it didn't fit as small as yours. I already have my dll entry function set up (with switch block for dwreason) and I will either include the tables inthe dll or make them on startup. I think it would be a lot better to make them on startup, so I can learn some new techniques
for my upcoming projects. I plan on doing a few more ciphers other than Rijndael, so I'm trying to get a full understanding of each one before I finish it. By next week I should have Rijndael done, and will do SHA next. This is all for personal studies, so thanks for helping!

PPS: I think I finally understand it now, I re-read the Rijndael Proposal paper. Okay: The T(0-3) tables are just an optimization of the SubBytes,ShiftRows,and MixColumns all in one, and the T(4-7) tables are for the InvSubBytes,InvShiftRows,and InvMixColumns all in one, provided that I apply the InvMixColumns to the Round Key. Therefore, each round in the cipher can be implemented (for each column) as four registers, four table lookups T(0-3), and four xor operations. This will perform each round quickly, providing enough memory. The inverse cipher can be done this way as well with T(4-7),as long as I apply a different transformation to the key schedule before beginning the rounds. Then, the tables that I need are the S-Box, InvS-box, T(0-3), and T(4-7). I see that you used U(0-3) in your inverse cipher, but I cannot see why yet. For now, I understand what you've been trying to explain to me. As long as I either provide the tables in my code, or implement them when my main program loads my library, the encryption routine will run as fast as possible. The U tables must be of speed performance in the InvCipher, would you mind explaining it to me? While I was looking at your code, it looks like a sort of placement transformation between the T(4-7) and U(0-3). I will further examine it and probably make a PPPS notation underneath. NOTE: I see the linear offset of T(4-7) to U(0-3) is the byte obtained by the Inv substitution table * 4, and this is used as the offset in which the T-table byte is placed into the U-table. Is this only for speeding up your cipher?

Ohh I've got it (again Smile ) The U(0-3) tables are used to speed up the calculation of decryption round keys. I'm kind of slow this morning, I'm just happy I just made the little connection in my brain. Well, I hope you did not read the whole paragraph I wrote. So may I use your optimized table-generating code? I may make two versions of the dll, one included and one with tables gen'd on startup. I guess that all I have to do now is code it all. Thanks for help, and that you have gone through this all before :/. I'll post it here when I get the first version of the code done, it will probably be hideous to good asm programmers lol.. Thanks again Revolution

OKAY- ONLY QUESTION I NEED ANSWERED!
-I've been thinking of what mode of operation to choose, and I think I would settle on either CFB or OFB... They use only encryption to perform both operations, and with the Rijndael cipher this would be the best choice for speed optimizations. What do you think?
Post 22 Dec 2007, 04:23
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 24 Dec 2007, 01:42
Hey Revolution, I found a flaw in your code Smile. You're implementation of RCon, when viewed in memory, due to little-endianness the first byte of the dword contains the value. In FIPS documentation, the lower byte of memory, or the upper byte of the registers, must contain the round constant value. The test I used was with 0x00000001, which is supposed to be 0x01000000. You should fix that in your code if you ever plan on using it, I'll put a little shl 24 or ror 8 in the code to fix it.
Post 24 Dec 2007, 01:42
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20416
Location: In your JS exploiting you and your system
revolution 24 Dec 2007, 01:44
My code correctly reproduces the test vectors. Perhaps you should check your code against the test vectors. If you don't do that then how can you know if my code works or not?
Post 24 Dec 2007, 01:44
View user's profile Send private message Visit poster's website Reply with quote
AlexP



Joined: 14 Nov 2007
Posts: 561
Location: Out the window. Yes, that one.
AlexP 24 Dec 2007, 01:52
I coded the key expansion routine prior to these posts, and it all worked fine. according to FIPS 197, the Round Constant table should have values with the most significant byte initialized. That is the way it was with my testing, and that is how it sais. Any questions? Smile
-Check with FIPS appendix A page 27.
Post 24 Dec 2007, 01:52
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.