flat assembler
Message board for the users of flat assembler.
Index
> Windows > Need higher analysis of function Goto page 1, 2 Next |
Author |
|
AlexP 16 Dec 2007, 16:52
My apologies for multiple threads, but I need some higher opinions of my code. I am attempting the SubWord() function of the Rijndael Cipher, and it was challenging to make in asm. There were other ways of doing it, but I chose to store the table in data and use an offset into this structure to do the substituting. I would love for anyone to look at this code, and please tell me what I can do to make it better, or any problems that you may see. The function is supposed to mask off all but the desired nibbles for the inner loop, which will track down and locate the appropriate offset into the structure. Here's the explanation if you would attempt to understand this code:
http://en.wikipedia.org/wiki/Rijndael_S-box Thanks! Code: ;=========================================== ;==== Start SubWord() ;=========================================== ;Substitute word function ;Takes a dword at a time, substitutes one byte at a time ;Loops, four bytes in eax are substituted -> eax ;preserves all regs except eax ;eax = row ebx = column ecx = i edx = nibble mask counter ;edx is calculated by shl 4 (bits) (0x000000001,0x00000010,ect..) SubWord: ;We are going to use ebx as column, ecx as i ;edx is used for the outer loop to mask off sequential nibbles push ebx push ecx push edx push ebp ;for "another variable" push esi push edi ;xor ebx,ebx ;ebx and edx are taken over anyway... xor ecx,ecx xor ebp,ebp xor esi,esi ;esi is final holder, until end ;xor edx,edx mov edx,0x00000001 ;first outer iteration push eax ;To preserve value until next loop, eax and ebx both masked ;Begin outer loop here!! ;======================= BeginOuterLoop: mov eax,[esp+0x4] ;original eax mov ebx,eax ;get ebx ready to be masked ;eax contains the four bytes, ebx is a copy ;i is register ecx ;shift edx sequentially on every outer loop run to get ; appropriate nibble ;Mask off here and ebx,edx ;we need this loop's column nibble shl edx,4 and eax,edx ;we need this loop's row nibble shl edx,4 ;Get it ready for next outer loop ;Odd masking arith, but it needs to be done ;Explanation: ; we needed the column nibble in ebx, row in eax ; so, because the column nibble is the furthest right, ; we had to mask it first, then shift the masking bit over ; to get the row nibble (most significant nibble, left) FirstCompare: cmp eax,ecx ;compare row against i je AfterFirstInnerLoop inc ecx jmp FirstCompare AfterFirstInnerLoop: ;(i*16) + 1 -> ebp push eax push ebx push edx mov eax,ecx mov ebx,0x00000010 ;16 mul ebx mov ebp,eax pop edx pop ebx pop eax inc ebp xor ecx,ecx ;reset i -> 0, row has been found, now columnloop SecondCompare: cmp ebx,ecx ;compare column to i je AfterSecondInnerLoop inc ecx jmp SecondCompare AfterSecondInnerLoop: add ebp,ecx ;Add row offset (i) to "another variable" ;Now, do the byte change into eax ;For this, we will have a mem location for sbox/invsbox cmp [IsInv],0 jne Inversed mov edi,[Sbox+ebp] ;Store in temp reg or esi,edi ;put replaced byte into final holder(esi) cmp edx,0x10000000 ;Check if this is last iteration jne BeginOuterLoop jmp EndSubWord Inversed: mov edi,[InvSbox+ebp] or esi,edi ;puts replaced byte into final holder(esi) cmp edx,0x10000000 jne BeginOuterLoop EndSubWord: ;Easy how one instruction does it all... mov eax,esi ;Put back regs add esp,0x4 ;Get rid of garbage eax from before outer loop pop edi pop esi pop ebp pop edx pop ecx pop ebx ret ;All registers except esp were used, all except eax restored ;=========================================== ;=== End of SubWord() ;=========================================== |
|||
16 Dec 2007, 16:52 |
|
revolution 16 Dec 2007, 17:16
Here's a present for you, it is a DLL I made may years ago. No comments or documentation so you'll have to work out how to use it on your own. I'm not even sure if it will still assemble on the latest fasm version. It is small because it generates all the table on the fly at startup.
Code: ;************************************************************************************************ ;constants ;************************************************************************************************ min_key = 4 max_key = 8 min_blocks = 4 max_blocks = 8 ;do_odd_blocks =1 ;do_odd_keys =1 format PE GUI 4.0 DLL entry RD_init include 'win32a.inc' b equ byte d equ dword ;************************************************************************************************ ;constants ;************************************************************************************************ rounds_off = 6 max_rounds = max_key+rounds_off if max_blocks > max_rounds max_rounds = max_blocks+rounds_off end if if defined do_odd_keys total_keys = max_key-min_key+1 else total_keys = (max_key-min_key)/2+1 end if if defined do_odd_blocks total_blocks = max_blocks-min_blocks+1 else total_blocks = (max_blocks-min_blocks)/2+1 end if ;************************************************************************************************ ;structures ;************************************************************************************************ struct RD_STATE working_b rd max_blocks align 16 rk rd max_blocks*(max_rounds+1) align 16 rk_d rd max_blocks*(max_rounds+1) block_size dd ? key_size dd ? rounds dd ? encrypt_func dd ? decrypt_func dd ? clear_size = $ handle dd ? ends struct RD_TABLES tab_s rb 256 tab_si rb 256 t1 rd 256 t2 rd 256 t3 rd 256 t4 rd 256 t5 rd 256 t6 rd 256 t7 rd 256 t8 rd 256 u1 rd 256 u2 rd 256 u3 rd 256 u4 rd 256 rcon rd 30 ends ;************************************************************************************************ ;global data section '.udata' data readable writeable ;************************************************************************************************ align 16 RD_tables RD_TABLES ;************************************************************************************************ ;macros ;************************************************************************************************ macro do_possible3a func,blk,rnd1,rnd2,rnd3,rnd4,rnd5 { do_next=0 if min_blocks<=blk & max_blocks>=blk if (min_key<=4) func blk,4,rnd1 if rnd1=0 do_next=1 end if end if if (min_key<=5 & max_key>=5 & defined do_odd_keys) | (do_next) func blk,5,rnd2 do_next=0 if rnd2=0 do_next=1 end if end if if (min_key<=6 & max_key>=6) | (do_next) func blk,6,rnd3 do_next=0 if rnd3=0 do_next=1 end if end if if (min_key<=7 & max_key>=7 & defined do_odd_keys) | (do_next) func blk,7,rnd4 do_next=0 if rnd4=0 do_next=1 end if end if if (max_key>=8) | (do_next) func blk,8,rnd5 end if end if } macro do_possibles3 func { do_possible3a func,4,10,11,12,13,14 if defined do_odd_blocks do_possible3a func,5,11,11,12,13,14 end if do_possible3a func,6,12,12,12,13,14 if defined do_odd_blocks do_possible3a func,7,13,13,13,13,14 end if do_possible3a func,8,14,14,14,14,14 } macro do_possibles2 func { do_possible3a func,4,10,11,12,13,14 if defined do_odd_blocks do_possible3a func,5,00,11,12,13,14 end if do_possible3a func,6,00,00,12,13,14 if defined do_odd_blocks do_possible3a func,7,00,00,00,13,14 end if do_possible3a func,8,00,00,00,00,14 } macro define_shifts blks { sft1=1 sft2=2 sft3=3 if blks > 6 sft3=4 end if if blks > 7 sft2=3 end if } ;************************************************************************************************ section '.code' code readable executable ;************************************************************************************************ GF_modulus =01bh GF_affine =01fh GF_magic =063h proc RD_init,hinstDLL,fdwReason,lpvReserved enter push ebx esi edi ebp ;************************************************************************************************ ;the log and alog tables are used to make the S box. Afterwards they are overwritten by t2 ;************************************************************************************************ alog_table =RD_tables.t2 log_table =RD_tables.t2+256 ;************************************************************************************************ ;make alog and log tables ;************************************************************************************************ mov eax,1 ;the first value is 1 mov b[alog_table],al mov b[log_table],0 ;actually log(0) is not used mov ecx,eax .alog: ;multiply AL by 3 in GF(2^ mov bh,al add bh,bh sbb bl,bl and bl,GF_modulus xor bh,bl xor al,bh mov b[ecx+alog_table],al mov b[eax+log_table],cl add cl,1 jnc .alog mov b[eax+log_table],cl ;set the last value mod 255 ;************************************************************************************************ ;make S box and inverse S box ;************************************************************************************************ mov edx,255 mov ecx,GF_magic ;the multiplicative inverse of 0 needs special handling (log and alog can't deal with 0) mov b[000+RD_tables.tab_s],cl mov b[ecx+RD_tables.tab_si],0 .sbox: movzx eax,b[edx+log_table] not al mov al,b[eax+alog_table] ;AL=multiplicative inverse of DL mov bl,GF_affine mov cl,GF_magic .sbox2: shr al,1 sbb bh,bh and bh,bl xor cl,bh rol bl,1 test al,al jnz .sbox2 mov [edx+RD_tables.tab_s],cl mov [ecx+RD_tables.tab_si],dl sub edx,1 jnz .sbox ;************************************************************************************************ ;make t5, t6, t7, t8, u1, u2, u3 & u4 ;************************************************************************************************ xor esi,esi xor edx,edx mov cl,b[log_table+0eh] mov ch,b[log_table+09h] mov bl,b[log_table+0dh] mov bh,b[log_table+0bh] .ti: movzx eax,b[esi+RD_tables.tab_si] lea edi,[eax*4] test eax,eax movzx ebp,al jz .ti2 mov al,b[eax+log_table] mov dl,al add dl,cl adc dl,0 mov dl,b[edx+alog_table] shrd ebp,edx,8 mov dl,al add dl,ch adc dl,0 mov dl,b[edx+alog_table] shrd ebp,edx,8 mov dl,al add dl,bl adc dl,0 mov dl,b[edx+alog_table] shrd ebp,edx,8 add al,bh adc al,0 mov al,b[eax+alog_table] shrd ebp,eax,8 .ti2: mov [esi*4+RD_tables.t5],ebp mov [edi+RD_tables.u1],ebp rol ebp,8 mov [esi*4+RD_tables.t6],ebp mov [edi+RD_tables.u2],ebp rol ebp,8 mov [esi*4+RD_tables.t7],ebp mov [edi+RD_tables.u3],ebp rol ebp,8 mov [esi*4+RD_tables.t8],ebp mov [edi+RD_tables.u4],ebp add esi,1 test esi,0ffh jnz .ti ;************************************************************************************************ ;make t1, t2, t3 & t4 ;************************************************************************************************ xor edx,edx .t: movzx eax,b[edx+RD_tables.tab_s] ;the log and alog tables will be destroyed so the multiplications are done directly ;multiply AL by 2 in GF(2^ mov ah,al add ah,ah sbb bl,bl and bl,GF_modulus xor ah,bl ;ah=al*2 ;multiply AL by 3 in GF(2^ mov ch,al xor ch,ah ;ch=al*3 ;make the value mov cl,al shl ecx,16 mov ch,al mov cl,ah mov [edx*4+RD_tables.t1],ecx rol ecx,8 mov [edx*4+RD_tables.t2],ecx rol ecx,8 mov [edx*4+RD_tables.t3],ecx rol ecx,8 mov [edx*4+RD_tables.t4],ecx add dl,1 jnc .t ;************************************************************************************************ ;make rcon ;************************************************************************************************ xor edx,edx mov eax,1 .rcon: mov [edx*4+RD_tables.rcon],eax ;mulitply AL by 2 in GF(2^ add al,al sbb bl,bl and bl,GF_modulus xor al,bl ;al=al*2 add dl,1 cmp dl,30 jb .rcon pop ebp edi esi ebx mov eax,TRUE return endp proc RD_clear_state,state enter push ebx edi mov eax,-1 mov ebx,[state] mov edx,RD_STATE.clear_size shr edx,2 mov edi,ebx mov ecx,edx rep stosd mov eax,0aaaaaaaah mov edi,ebx mov ecx,edx rep stosd not eax ;055555555h mov edi,ebx mov ecx,edx rep stosd xor eax,eax mov edi,ebx mov ecx,edx rep stosd pop edi ebx return endp proc RD_free_state,state enter stdcall RD_clear_state,[state] mov eax,[state] invoke VirtualUnlock,[eax+RD_STATE.handle],sizeof.RD_STATE+16 test eax,eax jz .done mov eax,[state] invoke VirtualFree,[eax+RD_STATE.handle],sizeof.RD_STATE+16,MEM_DECOMMIT+MEM_RELEASE .done: return endp proc RD_new_state,block_size,key_size enter xor eax,eax mov ecx,[block_size] sub ecx,min_blocks if total_blocks>1 jb .done cmp ecx,max_blocks-min_blocks ja .done if ~defined do_odd_blocks shr ecx,1 jc .done end if else jnz .done end if mov edx,[key_size] sub edx,min_key if total_keys>1 jb .done cmp edx,max_key-min_key ja .done if ~defined do_odd_keys shr edx,1 jc .done end if else jnz .done end if if total_keys=2 lea edx,[edx+ecx*2] else if total_keys=3 lea ecx,[ecx*2+ecx] add edx,ecx else if total_keys=4 lea edx,[edx+ecx*4] else if total_keys=5 lea ecx,[ecx*4+ecx] add edx,ecx end if push edx invoke VirtualAlloc,NULL,sizeof.RD_STATE+16,MEM_COMMIT,PAGE_READWRITE test eax,eax jz .done push eax ;lock the memory so it doesn't get put into the paging file invoke VirtualLock,eax,sizeof.RD_STATE+16 test eax,eax pop eax pop edx jz .release lea ecx,[eax+0fh] and ecx,not 0fh mov [ecx+RD_STATE.handle],eax mov eax,[edx*8+RD_jump_table+0] mov edx,[edx*8+RD_jump_table+4] mov [ecx+RD_STATE.encrypt_func],eax mov [ecx+RD_STATE.decrypt_func],edx mov eax,[block_size] mov edx,[key_size] mov [ecx+RD_STATE.block_size],eax mov [ecx+RD_STATE.key_size],edx cmp eax,edx lea eax,[eax+rounds_off] ja .a lea eax,[edx+rounds_off] .a: mov [ecx+RD_STATE.rounds],eax mov eax,ecx .done: return .release: invoke VirtualFree,eax,sizeof.RD_STATE+16,MEM_DECOMMIT+MEM_RELEASE xor eax,eax return endp proc RD_expand_key,state,input_key .loop_stop dd ? .loop_rcon dd ? enter push ebx esi edi ebp mov esi,[input_key] mov ebp,[state] lea edi,[ebp+RD_STATE.rk] mov ecx,[ebp+RD_STATE.key_size] rep movsd ;eax=scratch ;ebx=secondary compare for (loop counter mod key_size) ;ecx=loop counter ;edx=loop counter mod key_size ;esi=-key_size*4 ;edi=offset to current rk ;ebp=STATE ;make the encryption round keys mov eax,[ebp+RD_STATE.rounds] add eax,1 imul eax,[ebp+RD_STATE.block_size] sub eax,[ebp+RD_STATE.key_size] mov [.loop_stop],eax xor ecx,ecx xor ebx,ebx cmp [ebp+RD_STATE.key_size],6 setge bl shl ebx,2 xor edx,edx mov [.loop_rcon],RD_tables.rcon mov esi,[ebp+RD_STATE.key_size] shl esi,2 neg esi .a: mov eax,[edi-4] test edx,edx jnz .b movzx edx,al mov al,[edx+RD_tables.tab_s] ror eax,8 movzx edx,al mov al,[edx+RD_tables.tab_s] ror eax,8 movzx edx,al mov al,[edx+RD_tables.tab_s] ror eax,8 movzx edx,al mov al,[edx+RD_tables.tab_s] ror eax,16 ;rotbyte done here mov edx,[.loop_rcon] xor eax,[edx] ;rcon xor edx,edx ;reset the counter jmp .c .b: cmp edx,ebx jnz .c movzx edx,al mov al,[edx+RD_tables.tab_s] rol eax,8 movzx edx,al mov al,[edx+RD_tables.tab_s] rol eax,8 movzx edx,al mov al,[edx+RD_tables.tab_s] rol eax,8 movzx edx,al mov al,[edx+RD_tables.tab_s] rol eax,8 mov edx,ebx ;reset the counter .c: xor eax,[edi+esi] mov [edi],eax add edi,4 add edx,1 cmp edx,[ebp+RD_STATE.key_size] jb .d xor edx,edx add [.loop_rcon],4 .d: add ecx,1 cmp ecx,[.loop_stop] jb .a ;make the decryption round keys lea esi,[ebp+RD_STATE.rk] lea edi,[ebp+RD_STATE.rk_d] mov ecx,[ebp+RD_STATE.key_size] rep movsd mov ebp,[.loop_stop] sub edi,esi .e: movzx eax,b[esi] movzx ebx,b[esi+1] movzx ecx,b[esi+2] movzx edx,b[esi+3] mov eax,[eax*4+RD_tables.u1] xor eax,[ebx*4+RD_tables.u2] xor eax,[ecx*4+RD_tables.u3] xor eax,[edx*4+RD_tables.u4] mov [esi+edi],eax add esi,4 sub ebp,1 jnz .e pop ebp edi esi ebx return endp macro copy_data blks,source,dest { repeat (blks+3)/4 mov eax,[source+(%-1)*16] if (blks-(%-1)*4) >= 2 mov ebx,[source+(%-1)*16+4] end if if (blks-(%-1)*4) >= 3 mov ecx,[source+(%-1)*16+8] end if if (blks-(%-1)*4) >= 4 mov edx,[source+(%-1)*16+12] end if mov [dest+(%-1)*16],eax if (blks-(%-1)*4) >= 2 mov [dest+(%-1)*16+4],ebx end if if (blks-(%-1)*4) >= 3 mov [dest+(%-1)*16+8],ecx end if if (blks-(%-1)*4) >= 4 mov [dest+(%-1)*16+12],edx end if end repeat } macro keyaddition blks { ;esi=a ;edi=rk repeat (blks+3)/4 mov eax,[edi+(%-1)*16] if (blks-(%-1)*4) >= 2 mov ebx,[edi+(%-1)*16+4] end if if (blks-(%-1)*4) >= 3 mov ecx,[edi+(%-1)*16+8] end if if (blks-(%-1)*4) >= 4 mov edx,[edi+(%-1)*16+12] end if xor [esi+(%-1)*16],eax if (blks-(%-1)*4) >= 2 xor [esi+(%-1)*16+4],ebx end if if (blks-(%-1)*4) >= 3 xor [esi+(%-1)*16+8],ecx end if if (blks-(%-1)*4) >= 4 xor [esi+(%-1)*16+12],edx end if end repeat } macro tablelookup_key_add blks,source,dest,key_off { repeat blks movzx eax,b[source+((0000+%-1) mod blks)*4+0] movzx ebx,b[source+((sft1+%-1) mod blks)*4+1] movzx ecx,b[source+((sft2+%-1) mod blks)*4+2] mov eax,[eax*4+RD_tables.t1] xor eax,[ebx*4+RD_tables.t2] movzx ebx,b[source+((sft3+%-1) mod blks)*4+3] xor eax,[ecx*4+RD_tables.t3] xor eax,[edi+(%-1)*4+key_off] xor eax,[ebx*4+RD_tables.t4] mov [dest+(%-1)*4],eax end repeat } macro tablelookupf_key_add blks,source,dest,key_off { repeat blks movzx eax,b[source+((sft2+%-1) mod blks)*4+2] movzx ebx,b[source+((sft3+%-1) mod blks)*4+3] movzx ecx,b[source+((0000+%-1) mod blks)*4+0] movzx edx,b[source+((sft1+%-1) mod blks)*4+1] mov al,b[eax+RD_tables.tab_s] mov ah,b[ebx+RD_tables.tab_s] shl eax,16 mov al,b[ecx+RD_tables.tab_s] mov ah,b[edx+RD_tables.tab_s] xor eax,[edi+(%-1)*4+key_off] mov [dest+(%-1)*4],eax end repeat } macro encrypt_block blks,keys,rounds { align 16 encrypt_block_#blks#_#keys#: if rounds>0 define_shifts blks keyaddition blks mov edx,(rounds-1) add edi,blks*4 .a: tablelookup_key_add blks,esi,ebp,0 xchg esi,ebp add edi,blks*4 sub edx,1 jnz .a tablelookupf_key_add blks,esi,ebp,0 if (rounds and 1) copy_data blks,ebp,esi end if ret end if } do_possibles2 encrypt_block macro tablelookup2_key_add blks,source,dest,key_off { repeat blks movzx eax,b[source+((000000000+%-1) mod blks)*4+0] movzx ebx,b[source+((blks-sft1+%-1) mod blks)*4+1] movzx ecx,b[source+((blks-sft2+%-1) mod blks)*4+2] mov eax,[eax*4+RD_tables.t5] xor eax,[ebx*4+RD_tables.t6] movzx ebx,b[source+((blks-sft3+%-1) mod blks)*4+3] xor eax,[ecx*4+RD_tables.t7] xor eax,[edi+(%-1)*4-key_off] xor eax,[ebx*4+RD_tables.t8] mov [dest+(%-1)*4],eax end repeat } macro tablelookup2f_key_add blks,source,dest,key_off { repeat blks movzx eax,b[source+((blks-sft2+%-1) mod blks)*4+2] movzx ebx,b[source+((blks-sft3+%-1) mod blks)*4+3] movzx ecx,b[source+((000000000+%-1) mod blks)*4+0] movzx edx,b[source+((blks-sft1+%-1) mod blks)*4+1] mov al,[eax+RD_tables.tab_si] mov ah,[ebx+RD_tables.tab_si] shl eax,16 mov al,[ecx+RD_tables.tab_si] mov ah,[edx+RD_tables.tab_si] xor eax,[edi+(%-1)*4-key_off] mov [dest+(%-1)*4],eax end repeat } macro decrypt_block blks,keys,rounds { align 16 decrypt_block_#blks#_#keys#: if rounds>0 define_shifts blks add edi,rounds*blks*4 keyaddition blks mov edx,(rounds-1) sub edi,blks*4 .a: tablelookup2_key_add blks,esi,ebp,0 xchg esi,ebp sub edi,blks*4 sub edx,1 jnz .a tablelookup2f_key_add blks,esi,ebp,0 if (rounds and 1) copy_data blks,ebp,esi end if ret end if } do_possibles2 decrypt_block macro jump_table blks,keys,rounds { dd encrypt_block_#blks#_#keys,decrypt_block_#blks#_#keys } align 8 RD_jump_table: do_possibles3 jump_table proc RD_encrypt_block,state,block_pointer enter push ebx esi edi ebp mov eax,[state] mov esi,[block_pointer] lea edi,[eax+RD_STATE.rk] lea ebp,[eax+RD_STATE.working_b] call [eax+RD_STATE.encrypt_func] pop ebp edi esi ebx return endp proc RD_encrypt_blocks,state,block_pointer,blocks enter push ebx esi edi mov eax,[state] mov esi,[block_pointer] .a: lea edi,[eax+RD_STATE.rk] lea ebp,[eax+RD_STATE.working_b] push ebp call [eax+RD_STATE.encrypt_func] pop ebp mov eax,[state] mov esi,[block_pointer] mov ecx,[blocks] mov edx,[eax+RD_STATE.block_size] lea esi,[esi+edx*4] mov [block_pointer],esi sub ecx,1 mov [blocks],ecx jnz .a pop edi esi ebx return endp proc RD_decrypt_block,state,block_pointer enter push ebx esi edi ebp mov eax,[state] mov esi,[block_pointer] lea edi,[eax+RD_STATE.rk_d] lea ebp,[eax+RD_STATE.working_b] call [eax+RD_STATE.decrypt_func] pop ebp edi esi ebx return endp proc RD_decrypt_blocks,state,block_pointer,blocks enter push ebx esi edi mov eax,[state] mov esi,[block_pointer] .a: lea edi,[eax+RD_STATE.rk] lea ebp,[eax+RD_STATE.working_b] push ebp call [eax+RD_STATE.decrypt_func] pop ebp mov eax,[state] mov esi,[block_pointer] mov ecx,[blocks] mov edx,[eax+RD_STATE.block_size] lea esi,[esi+edx*4] mov [block_pointer],esi sub ecx,1 mov [blocks],ecx jnz .a pop edi esi ebx return endp section '.idata' import data readable writeable library kernel,'KERNEL32.DLL' import kernel,\ VirtualAlloc,'VirtualAlloc',\ VirtualFree,'VirtualFree',\ VirtualLock,'VirtualLock',\ VirtualUnlock,'VirtualUnlock' section '.edata' export data readable export 'RIJNDAEL.DLL',\ RD_decrypt_block,'RijndaelDecryptBlock',\ RD_decrypt_blocks,'RijndaelDecryptBlocks',\ RD_encrypt_block,'RijndaelEncryptBlock',\ RD_encrypt_blocks,'RijndaelEncryptBlocks',\ RD_expand_key,'RijndaelExpandKey',\ RD_free_state,'RijndaelFree',\ RD_new_state,'RijndaelOpen',\ RD_clear_state,'RijndaelWash' section '.reloc' fixups data discardable PS. you may need to alter the structure defs, because I used a custom structure macro back then when I wrote it. |
|||
16 Dec 2007, 17:16 |
|
AlexP 16 Dec 2007, 18:13
Thanks for checking it out- I will take a look at that code. I just located another asm implementation, and it is totally linear encrypt/decrypt in 16-bit mode. Very small, but I saw many odd instructions which turned out to be a lot of stuff in other files, more c code, ect.. so I didn't look anymore. I was thinking there was another way to do the multiplication you fixed above, thanks.
Okay, here's my next "challenge". Please criticize and riticule* to no end so I can make it more optimized! I have tried it, and it is working for now. It uses a substitution table (SBox) and, based on nibbles given byte by byte in eax, substitutes them from the table. Please refer to my earlier url to see the example. It is the simplest thing to do in real life, but a pain in asm. The point is to use the first nibble for the row of the table, the second nibble for the column, and replace the byte with the byte of the table in that location. Have fun!! Code: ;=========================================== ;==== Start SubWord() ;============================================ ;Substitute word function ;Takes a dword at a time, substitutes one byte at a time ;Loops, four bytes in eax are substituted -> eax ;preserves all regs except eax ;IsInv data byte is set before function, determines which table to use ;eax = current row nibble ;ebx = current column nibble ;ecx (cl) = mask counter ;edx = nibble mask ;ebp = original eax passed to function SubWord: push ecx push edx push edi push esi push ebx push ebp mov ebp,eax ;Used on every iteration mov edx,0x0000000F ;Initial mask setting xor ecx,ecx xor esi,esi BeginOuterLoop: mov eax,ebp ;Original eax mov ebx,ebp ;Original eax and ebx,edx ;ebx -> column mask shl edx,4 ;Increment masker 1 position shr ebx,cl ;Obtain value from mask -> ebx and eax,edx ;eax -> row mask shl edx,4 ;Increment masker 1 position add cl,4 shr eax,cl ;Obtain value from mask -> eax add cl,4 ;We need to multiply row pointer by 16 to obtain row offset ;(because there's 16 bytes in every row) lea eax,[eax*8] lea eax,[eax*2] ;Now calculate the substitution add eax,ebx ;Add the row offset to the column offset cmp [IsInv],0 jne Inversed mov edi,DWORD [SBox + eax] jmp AfterInv Inversed: mov edi,DWORD [InvSBox + eax] AfterInv: and edi,0x000000FF ;Mask off upper bits cmp cl,8 je First sub cl,8 shl edi,cl add cl,8 First: or esi,edi ;Place replaced bits in esi cmp edx,0 jne BeginOuterLoop ;Finishing sequence mov eax,esi pop ebp pop ebx pop esi pop edi pop edx pop ecx ret ;=========================================== ;=== End of SubWord() ;=========================================== WOW!!! I just got done of a full day of coding (I'm only a early teen, don't make fun of me) and spent about 4 hours debugging and optimising this code. Not the one above, I finally got a Rijndael key expansion (128,192, and 256-bit) program going, I ran it through multiple "testing" keys and all the results matched!!! I feel so good that I get to play with it tomorrow and learn how to use scanf with hex characters.... shall be fun!! thanks anyone who bothered to help me!!!! |
|||
16 Dec 2007, 18:13 |
|
AlexP 20 Dec 2007, 02:19
Is anyone still there? I was just wondering if the % modulus symbol in fasm is made into a lot of code, I see it in the dll above
|
|||
20 Dec 2007, 02:19 |
|
revolution 20 Dec 2007, 04:01
% gives the current repeat count, mod gives a modulus. This is not C.
|
|||
20 Dec 2007, 04:01 |
|
AlexP 20 Dec 2007, 22:21
Ohh yea, I forgot about that in the FASM manual. I've been pouring over asm and C implementations of Rijndael and it is driving me nuts finding an optimized MixColumns or InvMixColumns nonetheless. So, I'll quick check if the "mod" actually expands to a div instruction in the code. Thanks-
PS: if anyone can find a MixColumns in asm, please show me site. Rijndael dll above is nice, and probably works, but I am doing it just for education purposes and want to keep it in "function form"... aka not inlined. NEW QUESTION: Does the entry point in a DLL get executed whenever a function in the dll is called? How does that work? Would this work? Code: format PE GUI 4.0 DLL entry Init ;Later Init: push ebp mov ebp,esp pusha ret ;All my functions already have above code done for them? |
|||
20 Dec 2007, 22:21 |
|
revolution 21 Dec 2007, 00:44
The code I posted above does all the mix columns and other things.
Code: movzx eax,b[source+((sft2+%-1) mod blks)*4+2] Also referring to my code above, the DLL entry is called whenever it is mapped into, or out of, an address space. The reason for calling is passed in fdwReason. Well behaved code should check this parameter and act accordingly. Notice that my code is not well behaved because I completely ignore the value in fdwReason and just continue to compute the tables. |
|||
21 Dec 2007, 00:44 |
|
AlexP 21 Dec 2007, 01:26
Okay... So the entry is called once every time the dll is loaded? So the dll executes this code on startup, and then when the first function is called from the dll that code has been initiated? -Getting confused here-... I will look at those pieces of code there, my implementation of mixcolumns takes up dozens of lines of code, and is very confusing. It appears that you use some lookup tables called T1, T2, ect...What are those tables used for? I have read about log and alog tables for multiplication, but what are those?. Could you describe that code's functionality a little more? It looks like the section containing that code does a 'repeat blks' instruction and works on it a dword at a time. I use a very long, complex xoring operation that has the desired results as the documentation. I still have to debug the entire cipher/decipher, but it is put together. I'll have a lot to do, but thankfully we have off of school all next week. I'll get it put together and try using some of your techniques if I can understand them I'll post more questions about it here when I have them. Thanks again
|
|||
21 Dec 2007, 01:26 |
|
bitRAKE 21 Dec 2007, 05:03
AlexP wrote: Would this work? I don't know much about DLL's, but what I've read seems to indicate that the entry is called only after the DLL is mapped into a process' address space. The process that uses the DLL will call the functions within the DLL directly - Init will have taken place prior to any possible use of DLL functions by that process. So, if I understand correctly, the code you suggest will not be required. The DLL Init is used to establish the DLL within the process - kind of global one time things that the other functions might need. Not the local function specific stuff your code suggests. Last edited by bitRAKE on 21 Dec 2007, 06:03; edited 1 time in total |
|||
21 Dec 2007, 05:03 |
|
LocoDelAssembly 21 Dec 2007, 05:18
Quote:
Actually GPR registers, it is pusha, not pushf I don't understand how could this be intended behavior neither |
|||
21 Dec 2007, 05:18 |
|
AlexP 21 Dec 2007, 22:48
Sry- I've got it now. I was just confused because I didn't notice that was only about 3 lines of his, not several dozen. I've decided on having 4 sections of code: .Keys, .Tables, .Rijndael, and .Exports. Everything has been figured out as to how the entry point in a dll is called, thanks for help. I'm still looking for a fast algo for calculating the tables on runtime, I may do that or just put the tables in the .Tables section. I noticed that Revolution claimed his dll was smaller, but when I looked he reserved the same amount of memory as would have if the tables were already there. This lead me to another kind of dumb question: So if you have RD 256 the program reserves that much at runtime, compared to having a 256-dword data table, right? I've never had to create libraries much before so I believe it is a stupid question. I guess that the actual dll file is smaller with reserving data, then calculating on runtime, compared to having all the kilos of tables packed into the dll file. I will most likely start off with just having the tables hard-coded and then later calculating them when I can get the routines working. Any suggestions or comments?
PS: I used a push ebp, ect.. and then pusha because I usually do that for debug. I just used ebp to get the parameters, then pushed all the GPR's just to make debug first. Good analysis though |
|||
21 Dec 2007, 22:48 |
|
bitRAKE 21 Dec 2007, 23:06
Generating values at runtime costs additional space for code - sometimes this is worse than just storing the initial data. If coding for a small memory footprint then both code and data size must be concidered. Also, it's possible to save additional memory with readonly data in a DLL because when multiple processes are using the DLL the data only has to be mapped into each process - whereas changing memory resides in each process differently. I'm sure there is a way to have memory that 'belongs' to the DLL, but I don't know how.
A good example is the data for a window class - it takes more code to setup the local structure than to just have a global structure aleady initialized, and change the couple items. For some people it is a matter of style - keeping the data defined where it is used. Which is very understandable from a maintenance perspective. |
|||
21 Dec 2007, 23:06 |
|
AlexP 22 Dec 2007, 00:58
If anyone can find me a useful definition of how to use lookup tables for multiplication in GF(2^, it would be greatly appreciated. I have found several (you might find Samiam.org) and all they do is tell the general concept. I have been looking everywhereeeee and cannot find anyone who can explain how to calculate the tables, but I guess I can figure it out by myself in a few days. Please, if Revolution is there, tell me how you did it!!! Most of my previously "optimized" functions like SubWord above I have fit in only a dozen instructions or less. I would like to do the same for MixColumns and it's inverse, but the only thing holding me back is not knowing which T table is which, or what the U tables are for, ect... Please tell me the secret Revolution!!
|
|||
22 Dec 2007, 00:58 |
|
bitRAKE 22 Dec 2007, 01:35
|
|||
22 Dec 2007, 01:35 |
|
revolution 22 Dec 2007, 02:26
AlexP wrote: ... and cannot find anyone who can explain how to calculate the tables, but I guess I can figure it out by myself in a few days. AlexP wrote: Please, if Revolution is there, tell me how you did it!!! ... Please tell me the secret Revolution!! AlexP wrote: ... the only thing holding me back is not knowing which T table is which, or what the U tables are for, ect... There are 8xT tables and 4xU tables. If you are keen to save memory you can cut this down to 2xT and 1xU because the other tables are just rotations of the primary table. That is, (.t1) = rot8(.t2) = rot16(.t3) = rot24(.t4) and similarly for .t5-->.t8 and .u1-->.u4. Of course then you have to do the rotation in the code. It is a trade-off between speed and memory usage. Try it in different ways, no tables, 3 tables, all 12 tables, etc. and see which you are most comfortable with. The S, SI, T and U tables contain a lot of the algorithm in precomputed form. I already mentioned the rotations, but also the log and alog are in there and a few other things. BTW: I think my implementation of calculating the tables is unique, or at least if not unique, I can't find anywhere where someone has posted it. All of the other posted codes I have seen just use the tables given in the reference source code. You can also just use the precomputed tables in hundreds of boring lines of dd directives in the source. See the attachment. But it is more fun to create your own |
|||
22 Dec 2007, 02:26 |
|
AlexP 22 Dec 2007, 04:23
Hmmm.. i thought that all of those tables were for the multiplication lookup, when you call MixColumns and have to do the finite multiplying, I thought the tables were a collection of all 256 possibilities of 3,1,1,2, and 0e,0b,0d, and 09. I know the roundconstant,s-box,invs-box but I cannot think of any finite field multiplying that has to be done outside of the MixColumns steps, or the two log tables to make the s-boxes. Maybe you need two for the normal s-box, two for the inv-sbox. That would make sense.... I'll read through some more stuff and try to figure it out, don't bother replying if you feel like giving up on me -Thanks for code
PS: Would you mind if I used your implementation? I tried making a couple tables myself and it didn't fit as small as yours. I already have my dll entry function set up (with switch block for dwreason) and I will either include the tables inthe dll or make them on startup. I think it would be a lot better to make them on startup, so I can learn some new techniques for my upcoming projects. I plan on doing a few more ciphers other than Rijndael, so I'm trying to get a full understanding of each one before I finish it. By next week I should have Rijndael done, and will do SHA next. This is all for personal studies, so thanks for helping! PPS: I think I finally understand it now, I re-read the Rijndael Proposal paper. Okay: The T(0-3) tables are just an optimization of the SubBytes,ShiftRows,and MixColumns all in one, and the T(4-7) tables are for the InvSubBytes,InvShiftRows,and InvMixColumns all in one, provided that I apply the InvMixColumns to the Round Key. Therefore, each round in the cipher can be implemented (for each column) as four registers, four table lookups T(0-3), and four xor operations. This will perform each round quickly, providing enough memory. The inverse cipher can be done this way as well with T(4-7),as long as I apply a different transformation to the key schedule before beginning the rounds. Then, the tables that I need are the S-Box, InvS-box, T(0-3), and T(4-7). I see that you used U(0-3) in your inverse cipher, but I cannot see why yet. For now, I understand what you've been trying to explain to me. As long as I either provide the tables in my code, or implement them when my main program loads my library, the encryption routine will run as fast as possible. The U tables must be of speed performance in the InvCipher, would you mind explaining it to me? While I was looking at your code, it looks like a sort of placement transformation between the T(4-7) and U(0-3). I will further examine it and probably make a PPPS notation underneath. NOTE: I see the linear offset of T(4-7) to U(0-3) is the byte obtained by the Inv substitution table * 4, and this is used as the offset in which the T-table byte is placed into the U-table. Is this only for speeding up your cipher? Ohh I've got it (again ) The U(0-3) tables are used to speed up the calculation of decryption round keys. I'm kind of slow this morning, I'm just happy I just made the little connection in my brain. Well, I hope you did not read the whole paragraph I wrote. So may I use your optimized table-generating code? I may make two versions of the dll, one included and one with tables gen'd on startup. I guess that all I have to do now is code it all. Thanks for help, and that you have gone through this all before :/. I'll post it here when I get the first version of the code done, it will probably be hideous to good asm programmers lol.. Thanks again Revolution OKAY- ONLY QUESTION I NEED ANSWERED! -I've been thinking of what mode of operation to choose, and I think I would settle on either CFB or OFB... They use only encryption to perform both operations, and with the Rijndael cipher this would be the best choice for speed optimizations. What do you think? |
|||
22 Dec 2007, 04:23 |
|
AlexP 24 Dec 2007, 01:42
Hey Revolution, I found a flaw in your code . You're implementation of RCon, when viewed in memory, due to little-endianness the first byte of the dword contains the value. In FIPS documentation, the lower byte of memory, or the upper byte of the registers, must contain the round constant value. The test I used was with 0x00000001, which is supposed to be 0x01000000. You should fix that in your code if you ever plan on using it, I'll put a little shl 24 or ror 8 in the code to fix it.
|
|||
24 Dec 2007, 01:42 |
|
revolution 24 Dec 2007, 01:44
My code correctly reproduces the test vectors. Perhaps you should check your code against the test vectors. If you don't do that then how can you know if my code works or not?
|
|||
24 Dec 2007, 01:44 |
|
AlexP 24 Dec 2007, 01:52
I coded the key expansion routine prior to these posts, and it all worked fine. according to FIPS 197, the Round Constant table should have values with the most significant byte initialized. That is the way it was with my testing, and that is how it sais. Any questions?
-Check with FIPS appendix A page 27. |
|||
24 Dec 2007, 01:52 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.