flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
tthsqe 07 Oct 2009, 05:54
I'm just wonder what peoples' thoughts/experiences are on self-modifying code.
|
|||
![]() |
|
MazeGen 07 Oct 2009, 06:48
Don't use it, if you don't have to.
|
|||
![]() |
|
revolution 07 Oct 2009, 07:53
My thoughts on SMC are that it is UGLY.
My experiences with SMC are that it is UGLY. If you want to make UGLY code that is hard to debug and maintain then SMC may be just the perfect solution. ![]() However if, like most sane people, you would rather have nicer code that is easier to debug and maintain then you should avoid SMC at all times. ![]() |
|||
![]() |
|
Pirata Derek 07 Oct 2009, 08:29
My experience with the Self-modifyng code was good, but not so well in the first time (when learning SMC)
![]() In past I implemented the SMC only in some kind of viruses (EXE infectors and Boot-sector infectors) and in some programs i didn't want someone know them source-codes. You should use SMC only if you want hide something... Now i don't use SMC because i don't write viruses and i don't have to hide my programs codes yet... they are "Open-source" now. |
|||
![]() |
|
tthsqe 08 Oct 2009, 00:44
so it seems that the general consensus is that it is to be avoided...unless you want to make cryptic code...
Suppose we have a problem that: 1. depends on a given fixed n (say between 1 and 50) 2. the algorithm to solve the problem involves a performance-sensitive simple loop over n 3. Performance would be greatly increased if this loop could be completely/partially unrolled (or something like this) If we don't want a specific copy of the code for each of the 50 cases, and the performance gain significantly outweighs the overhead of modifying the code, wouldn't it make sense (performance wise) to use SMC? Or maybe there are ways of getting around this without using SMC? ![]() |
|||
![]() |
|
LocoDelAssembly 08 Oct 2009, 00:54
tthsqe, what algorithm do you have in mind?
Note that if performance is what are you pursuing you may actually want to have the 50 cases in RAM, that would save setup time which besides involving the self-modification, it may also require cache flushing among other things. Tell us what you want to make self-modifiable so we can better hint you about this topic. |
|||
![]() |
|
tthsqe 08 Oct 2009, 03:26
Code: format PE GUI entry start include 'win32a.inc' section '.text' code readable writable executable start: lea edi,[powerfunction] ; location to write code mov ecx,5 ; exponent call writecode movaps xmm0,dqword[numbers] call powerfunction ; calculates the nth power of packed singles invoke ExitProcess,0 writecode: push edx mov edx,[.mov1c] mov [edi],edx mov edx,[.mov1c+4] mov [edi+4],edx add edi,7 .1: shr ecx,1 jc .2 mov edx,[.mul00] mov [edi],edx lea edi,[edi+3] jnz .1 jmp .3 .2: mov edx,[.mul10] mov [edi],edx mov edx,[.mul00] mov [edi+3],edx lea edi,[edi+6] jnz .1 .3: mov edx,[.mov01] mov [edi-3],edx mov edx,[.ret] mov [edi],edx pop edx ret .mul10: mulps xmm1,xmm0 nop .mul00: mulps xmm0,xmm0 nop .mov1c: movaps xmm1,dqword[const1] nop .mov01: movaps xmm0,xmm1 nop .ret: ret nop nop nop align 16 powerfunction: rept 256 {nop} section '.data' data readable writeable Align 8 const1 dd 1.0,1.0,1.0,1.0 numbers dd 1.2,-3.4,5.6,7.8 section ',idata' import data readable writable library kernel,'KERNEL32.DLL' import kernel,\ ExitProcess,'ExitProcess' Basically, any algorithm where the code can be optimized for specific cases. The writecode function tries to perform a psuedo "superoptimization" (not very good in this example), and the generated code is called enough to make up for whatever time is lost. An example could be an explorer for generalized mandelbrot sets: z <- z^n+c. Specific cases can be optimized with an exaustive search, and how many ever milli-seconds this takes, the user will only notice the performance increase while having no preset limit on the range of n. ![]() |
|||
![]() |
|
tthsqe 08 Oct 2009, 03:32
How bad are the cache flush and other things? Is it flushed EVERY time a write instruction is issued?
|
|||
![]() |
|
sinsi 08 Oct 2009, 03:47
Any write to a code segment will invalidate the cache lines, but it seems to be worse if it's in the same 1k sub-page and even worse if the code has been decoded.
Intel use the phrase "significantly degraded performance" quite a lot when talking about SMC. |
|||
![]() |
|
LocoDelAssembly 08 Oct 2009, 05:14
sinsi, perhaps not an issue anymore but AFAIK, under SMP systems you better call FlushInstructionCache (Windows API), because the others CPUs/cores may not follow the change (all of them should follow it in their data caches, though). Also note that you should do this even on single threaded applications as them won't run in a single CPU/core unless you explicitly tell so with the affinity masks.
tthsqe, well, if you promise that the SMC will be executed enough to hide the costs of "compiling" it then I think it is OK, but you must make sure it is actually true, the clock cycles saved by using specialized code may not be big enough to compensate the SMC overhead. Also check if you can actually have all the cases compiled and then after the exhaustive search store a pointer to the one that should be called. (Sorry, not sure what you are doing with the mandelbrot thing here ![]() |
|||
![]() |
|
sinsi 08 Oct 2009, 06:37
loco: I was talking about cpu level, not windows level. You would have to be crazy (loco?) to use smc in windows (or any threaded OS I guess).
These days it's better to 'bloat' your code with a few functions that might not get used considering the amount of ram around. |
|||
![]() |
|
LocoDelAssembly 08 Oct 2009, 07:20
Quote:
![]() I mentioned the API because it takes care of invalidating of the other CPUs/cores caches than the one running the thread (CLFLUSH signals the others CPUs/cores making the API unnecessary on modern PCs?). As I've said above I would go for the method of having all the cases pre-compiled and then having a run-time set pointer pointing to the selected implementation. However, in some cases it may not be feasable, for instance if you want to replace a memory load of the generalized algorithm with an imm operand and the possible values are way to much to have a separate implementation for each. |
|||
![]() |
|
tthsqe 08 Oct 2009, 21:29
LocoDelAssembly,
The performance gains are real. The unrolled version: Code: movaps xmm0,dqword[numbers] movaps xmm1,dqword[const1] mulps xmm1,xmm0 mulps xmm0,xmm0 mulps xmm0,xmm0 mulps xmm1,xmm0 mulps xmm0,xmm0 mulps xmm0,xmm0 mulps xmm1,xmm0 movaps xmm0,xmm1 executes almost twice as fast as: Code: mov ecx,21 movaps xmm0,dqword[numbers] movaps xmm1,dqword[const1] .1: shr ecx,1 jc .2 mulps xmm0,xmm0 jnz .1 jmp .3 .2: mulps xmm1,xmm0 mulps xmm0,xmm0 jnz .1 .3: movaps xmm0,xmm1 on the P4 I'm at now. Do you know any estimate for penalty incurred with something like the example I posted above? |
|||
![]() |
|
LocoDelAssembly 08 Oct 2009, 22:07
Yes, under P4 (and I hope not Core i7 too as they reintroduced HT), removing non-predictable branches is indeed a good idea.
About the penalty I don't know but may be hundreds times of the time the unrolled version takes. It is important to make sure you actually execute the SMC code enough times to make worth it. Note that I mean in the case of having to call the OS services to clear the cache, otherwise it is a lot cheaper (but still worthless if you'll call the SMC code few times). In the example you comment however, I think you should have all the specialized versions compiled as ECX cannot be very large here. Also note that since the pow code would be together with the rest of the code you may save the processor from having to spent an extra TLB slot to remember the memory page where the specialized version resides. |
|||
![]() |
|
LocoDelAssembly 11 Oct 2009, 02:43
Just to clarify a bit, this is more or less what I mean by having all the versions precompiled:
Code: MAX_POW equ 128 format pe console include 'win32ax.inc' sub esp, 16 and esp, -16 xor ebx, ebx .loop: mov eax, [.const2] mov edx, [.const2+4] mov [esp], eax mov [esp+4], edx mov eax, [.const2+8] mov edx, [.const2+12] mov [esp+8], eax mov [esp+12], edx mov eax, esp mov ecx, ebx call pow cvtss2sd xmm0, [esp] movaps [esp], xmm0 mov eax, esp cinvoke printf, <"%f", 10>, dword [eax], dword [eax+4] inc ebx cmp ebx, MAX_POW jbe .loop cinvoke system, "pause" invoke ExitProcess, 0 .const2 dd 2.0, 2.0, 2.0, 2.0 ;;;;;; ;;;;;; POW code (There should be MAX_POW versions of your loop actually with their respective pow cases inlined) ;;;;;; align 16 pow.const1 dd 1.0, 1.0, 1.0, 1.0 pow: ; Assumes valid numbers (i.e. no NaNs, INFs, etc.) ; Assumes ECX = [0..MAX_POW] movaps xmm0, dqword [.const1] movaps xmm1, [eax] jmp [.powTable+ecx*4] align 16 .pow0: ; Is it OK for 0^0? movaps [eax], xmm0 ret align 16 .pow1: ret rept (MAX_POW) - 1 power:2 { common local i forward local ..powX i = power align 16 ..powX: while i > 0 if i mod 2 mulps xmm0, xmm1 end if if i > 1 mulps xmm1, xmm1 end if i = i / 2 end while movaps [eax], xmm0 ret common align 64 .powTable dd .pow0, .pow1 forward dd ..powX } align 4 ; Just to be safe data import library kernel32, 'kernel32.dll',\ msvcrt,'msvcrt.dll' import kernel32,\ ExitProcess, 'ExitProcess' import msvcrt,\ printf, 'printf',\ system, 'system' end data |
|||
![]() |
|
kohlrak 11 Oct 2009, 03:23
Not really self-modifying code, but self-modifying executable.
![]() It's somewhat useful if you like entirely self-contained programs. Code: include '/fasminc/unistd64.inc' ;dig for this in the macros section format ELF64 executable at 0x100000000 entry $ pop rax ;number of params, which we won't need unless we want to error check (pointless as this is just a tut) pop r13 ;program path, which we'll need later pop r12 ;current message syscalls sys_write, STDOUT, msg1, msg2-msg1 syscalls sys_write, STDOUT, storage, DWORD [oldmsgsize] syscalls sys_write, STDOUT, msg2, oldmsgsize-msg2 syscalls sys_unlink, r13 ;Don't know why, but you can only write to a running process if (a) you are not the process or (b) you delete yourself first syscalls sys_open, r13, O_RDWR or O_CREAT, 448 ;The last param's official name is so ugly you're more likely to remember the number than it mov ebx, eax mov rsi, r12 lea rdi, [storage] @@: lodsb stosb or al, al jnz @b sub rsi, r12 mov r14, rsi mov [oldmsgsize], esi syscalls sys_write, ebx, 0x100000000, endprog-0x100000000 syscalls sys_write, STDOUT, r12, r14 syscalls sys_write, STDOUT, newline, 1 syscalls sys_close, ebx syscalls sys_exit ;================================================================================================================================= storage rb 32 msg1 db "Old message was: " msg2 db 10, "Current message is: " oldmsgsize dd 0 newline db 10 ;================================================================================================================================= endprog: |
|||
![]() |
|
f0dder 21 Oct 2009, 07:47
Imho SMC (as in a function that patches itself on every entry) usually isn't good these days, because of penalties already mentioned. But runtime generated code (generate once, execute many) can work pretty well
![]() |
|||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.