flat assembler
Message board for the users of flat assembler.

Index > Main > self-modifying code

Author
Thread Post new topic Reply to topic
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 07 Oct 2009, 05:54
I'm just wonder what peoples' thoughts/experiences are on self-modifying code.
Post 07 Oct 2009, 05:54
View user's profile Send private message Reply with quote
MazeGen



Joined: 06 Oct 2003
Posts: 977
Location: Czechoslovakia
MazeGen 07 Oct 2009, 06:48
Don't use it, if you don't have to.
Post 07 Oct 2009, 06:48
View user's profile Send private message Visit poster's website Reply with quote
sinsi



Joined: 10 Aug 2007
Posts: 794
Location: Adelaide
sinsi 07 Oct 2009, 06:54
The only time I used it was to tell the difference between an 8088 and an 8086 (to tell the prefetch size).
Too dodgy a subject nowadays...no normal need for it (DID YOU HEAR THAT SONY? EH?)
Post 07 Oct 2009, 06:54
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20520
Location: In your JS exploiting you and your system
revolution 07 Oct 2009, 07:53
My thoughts on SMC are that it is UGLY.
My experiences with SMC are that it is UGLY.

If you want to make UGLY code that is hard to debug and maintain then SMC may be just the perfect solution. Evil or Very Mad

However if, like most sane people, you would rather have nicer code that is easier to debug and maintain then you should avoid SMC at all times. Wink
Post 07 Oct 2009, 07:53
View user's profile Send private message Visit poster's website Reply with quote
Pirata Derek



Joined: 31 Oct 2008
Posts: 259
Location: Italy
Pirata Derek 07 Oct 2009, 08:29
My experience with the Self-modifyng code was good, but not so well in the first time (when learning SMC) Wink

In past I implemented the SMC only in some kind of viruses (EXE infectors and Boot-sector infectors) and in some programs i didn't want someone know them source-codes.

You should use SMC only if you want hide something...

Now i don't use SMC because i don't write viruses and i don't have to hide my programs codes yet... they are "Open-source" now.
Post 07 Oct 2009, 08:29
View user's profile Send private message Send e-mail Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 08 Oct 2009, 00:44
so it seems that the general consensus is that it is to be avoided...unless you want to make cryptic code...
Suppose we have a problem that:
1. depends on a given fixed n (say between 1 and 50)
2. the algorithm to solve the problem involves a performance-sensitive simple loop over n
3. Performance would be greatly increased if this loop could be completely/partially unrolled (or something like this)
If we don't want a specific copy of the code for each of the 50 cases, and the performance gain significantly outweighs the overhead of modifying the code, wouldn't it make sense (performance wise) to use SMC?
Or maybe there are ways of getting around this without using SMC? Confused
Post 08 Oct 2009, 00:44
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 08 Oct 2009, 00:54
tthsqe, what algorithm do you have in mind?

Note that if performance is what are you pursuing you may actually want to have the 50 cases in RAM, that would save setup time which besides involving the self-modification, it may also require cache flushing among other things.

Tell us what you want to make self-modifiable so we can better hint you about this topic.
Post 08 Oct 2009, 00:54
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 08 Oct 2009, 03:26
Code:
format PE GUI
entry start

include 'win32a.inc'

section '.text' code readable writable executable

start:
        lea  edi,[powerfunction]  ; location to write code
        mov  ecx,5                ; exponent
        call writecode
      movaps xmm0,dqword[numbers]
        call powerfunction        ; calculates the nth power of packed singles
      invoke ExitProcess,0

writecode:
        push edx
        mov  edx,[.mov1c]
        mov  [edi],edx
        mov  edx,[.mov1c+4]
        mov  [edi+4],edx
        add  edi,7
.1:     shr  ecx,1
        jc   .2
        mov  edx,[.mul00]
        mov  [edi],edx
        lea  edi,[edi+3]
        jnz  .1
        jmp  .3
.2:     mov  edx,[.mul10]
        mov  [edi],edx
        mov  edx,[.mul00]
        mov  [edi+3],edx
        lea  edi,[edi+6]
        jnz  .1
.3:     mov  edx,[.mov01]
        mov  [edi-3],edx
        mov  edx,[.ret]
        mov  [edi],edx
        pop  edx
        ret
.mul10:  mulps xmm1,xmm0
         nop
.mul00:  mulps xmm0,xmm0
         nop
.mov1c:  movaps xmm1,dqword[const1]
         nop
.mov01:  movaps xmm0,xmm1
         nop
.ret:    ret
         nop
         nop
         nop

align 16
powerfunction: rept 256 {nop}

section '.data' data readable writeable

Align 8
 const1  dd 1.0,1.0,1.0,1.0
 numbers dd 1.2,-3.4,5.6,7.8

section ',idata' import data readable writable

library kernel,'KERNEL32.DLL'

import kernel,\
       ExitProcess,'ExitProcess'    

Basically, any algorithm where the code can be optimized for specific cases. The writecode function tries to perform a psuedo "superoptimization" (not very good in this example), and the generated code is called enough to make up for whatever time is lost. An example could be an explorer for generalized mandelbrot sets: z <- z^n+c. Specific cases can be optimized with an exaustive search, and how many ever milli-seconds this takes, the user will only notice the performance increase while having no preset limit on the range of n. Smile
Post 08 Oct 2009, 03:26
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 08 Oct 2009, 03:32
How bad are the cache flush and other things? Is it flushed EVERY time a write instruction is issued?
Post 08 Oct 2009, 03:32
View user's profile Send private message Reply with quote
sinsi



Joined: 10 Aug 2007
Posts: 794
Location: Adelaide
sinsi 08 Oct 2009, 03:47
Any write to a code segment will invalidate the cache lines, but it seems to be worse if it's in the same 1k sub-page and even worse if the code has been decoded.
Intel use the phrase "significantly degraded performance" quite a lot when talking about SMC.
Post 08 Oct 2009, 03:47
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 08 Oct 2009, 05:14
sinsi, perhaps not an issue anymore but AFAIK, under SMP systems you better call FlushInstructionCache (Windows API), because the others CPUs/cores may not follow the change (all of them should follow it in their data caches, though). Also note that you should do this even on single threaded applications as them won't run in a single CPU/core unless you explicitly tell so with the affinity masks.

tthsqe, well, if you promise that the SMC will be executed enough to hide the costs of "compiling" it then I think it is OK, but you must make sure it is actually true, the clock cycles saved by using specialized code may not be big enough to compensate the SMC overhead. Also check if you can actually have all the cases compiled and then after the exhaustive search store a pointer to the one that should be called. (Sorry, not sure what you are doing with the mandelbrot thing here Razz)
Post 08 Oct 2009, 05:14
View user's profile Send private message Reply with quote
sinsi



Joined: 10 Aug 2007
Posts: 794
Location: Adelaide
sinsi 08 Oct 2009, 06:37
loco: I was talking about cpu level, not windows level. You would have to be crazy (loco?) to use smc in windows (or any threaded OS I guess).
These days it's better to 'bloat' your code with a few functions that might not get used considering the amount of ram around.
Post 08 Oct 2009, 06:37
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 08 Oct 2009, 07:20
Quote:

crazy (loco?)
Yes! Very Happy

I mentioned the API because it takes care of invalidating of the other CPUs/cores caches than the one running the thread (CLFLUSH signals the others CPUs/cores making the API unnecessary on modern PCs?).

As I've said above I would go for the method of having all the cases pre-compiled and then having a run-time set pointer pointing to the selected implementation. However, in some cases it may not be feasable, for instance if you want to replace a memory load of the generalized algorithm with an imm operand and the possible values are way to much to have a separate implementation for each.
Post 08 Oct 2009, 07:20
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 08 Oct 2009, 21:29
LocoDelAssembly,
The performance gains are real.
The unrolled version:
Code:
       movaps xmm0,dqword[numbers]
       movaps xmm1,dqword[const1]
        mulps xmm1,xmm0
        mulps xmm0,xmm0
        mulps xmm0,xmm0
        mulps xmm1,xmm0
        mulps xmm0,xmm0
        mulps xmm0,xmm0
        mulps xmm1,xmm0
       movaps xmm0,xmm1    

executes almost twice as fast as:
Code:
         mov  ecx,21
       movaps xmm0,dqword[numbers]
       movaps xmm1,dqword[const1]
.1:      shr  ecx,1
         jc   .2
        mulps xmm0,xmm0
         jnz  .1
         jmp  .3
.2:     mulps xmm1,xmm0
        mulps xmm0,xmm0
         jnz  .1
.3:    movaps xmm0,xmm1    

on the P4 I'm at now. Do you know any estimate for penalty incurred with something like the example I posted above?
Post 08 Oct 2009, 21:29
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 08 Oct 2009, 22:07
Yes, under P4 (and I hope not Core i7 too as they reintroduced HT), removing non-predictable branches is indeed a good idea.

About the penalty I don't know but may be hundreds times of the time the unrolled version takes. It is important to make sure you actually execute the SMC code enough times to make worth it. Note that I mean in the case of having to call the OS services to clear the cache, otherwise it is a lot cheaper (but still worthless if you'll call the SMC code few times).

In the example you comment however, I think you should have all the specialized versions compiled as ECX cannot be very large here. Also note that since the pow code would be together with the rest of the code you may save the processor from having to spent an extra TLB slot to remember the memory page where the specialized version resides.
Post 08 Oct 2009, 22:07
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 11 Oct 2009, 02:43
Just to clarify a bit, this is more or less what I mean by having all the versions precompiled:
Code:
MAX_POW equ 128

format pe console
include 'win32ax.inc'

      sub      esp, 16
      and      esp, -16
      xor      ebx, ebx

.loop:
      mov      eax, [.const2]
      mov      edx, [.const2+4]
      mov      [esp], eax
      mov      [esp+4], edx
      mov      eax, [.const2+8]
      mov      edx, [.const2+12]
      mov      [esp+8], eax
      mov      [esp+12], edx

      mov      eax, esp
      mov      ecx, ebx
      call     pow

      cvtss2sd xmm0, [esp]
      movaps   [esp], xmm0
      mov      eax, esp

      cinvoke  printf, <"%f", 10>, dword [eax], dword [eax+4]

      inc      ebx
      cmp      ebx, MAX_POW
      jbe       .loop

      cinvoke system, "pause"
  invoke  ExitProcess, 0

.const2 dd 2.0, 2.0, 2.0, 2.0

;;;;;;
;;;;;; POW code (There should be MAX_POW versions of your loop actually with their respective pow cases inlined)
;;;;;;
align 16
pow.const1 dd 1.0, 1.0, 1.0, 1.0
pow: ; Assumes valid numbers (i.e. no NaNs, INFs, etc.)
     ; Assumes ECX = [0..MAX_POW]

      movaps   xmm0, dqword [.const1]
      movaps   xmm1, [eax]

      jmp      [.powTable+ecx*4]

align 16
.pow0: ; Is it OK for 0^0?
      movaps   [eax], xmm0

      ret

align 16
.pow1:
      ret

rept (MAX_POW) - 1 power:2
{
common
  local i

forward
local ..powX

  i = power

align 16
..powX:

  while i > 0

    if i mod 2

      mulps   xmm0, xmm1
    end if
    if i > 1

      mulps   xmm1, xmm1
    end if

    i = i / 2
  end while
      movaps  [eax], xmm0

      ret
common

align 64
.powTable dd .pow0, .pow1

forward
  dd ..powX
}

align 4 ; Just to be safe
data import 
  library kernel32, 'kernel32.dll',\
          msvcrt,'msvcrt.dll'

  import kernel32,\
         ExitProcess, 'ExitProcess'

  import msvcrt,\
         printf, 'printf',\
         system, 'system'
end data    
Post 11 Oct 2009, 02:43
View user's profile Send private message Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak 11 Oct 2009, 03:23
Not really self-modifying code, but self-modifying executable. Razz

It's somewhat useful if you like entirely self-contained programs.

Code:
include '/fasminc/unistd64.inc' ;dig for this in the macros section
format ELF64 executable at 0x100000000
entry $
       pop rax ;number of params, which we won't need unless we want to error check (pointless as this is just a tut)
     pop r13 ;program path, which we'll need later
      pop r12 ;current message

        syscalls sys_write, STDOUT, msg1, msg2-msg1
 syscalls sys_write, STDOUT, storage, DWORD [oldmsgsize]
     syscalls sys_write, STDOUT, msg2, oldmsgsize-msg2
   syscalls sys_unlink, r13 ;Don't know why, but you can only write to a running process if (a) you are not the process or (b) you delete yourself first
      syscalls sys_open, r13, O_RDWR or O_CREAT, 448 ;The last param's official name is so ugly you're more likely to remember the number than it

   mov ebx, eax
        mov rsi, r12
        lea rdi, [storage]
@@:   lodsb
       stosb
       or al, al
   jnz @b
      sub rsi, r12
        mov r14, rsi
        mov [oldmsgsize], esi

   syscalls sys_write, ebx, 0x100000000, endprog-0x100000000
   syscalls sys_write, STDOUT, r12, r14
        syscalls sys_write, STDOUT, newline, 1
      syscalls sys_close, ebx
     syscalls sys_exit
;=================================================================================================================================
storage rb 32
msg1 db "Old message was: "
msg2 db 10, "Current message is: "
oldmsgsize dd 0
newline db 10
;=================================================================================================================================
endprog:    
Post 11 Oct 2009, 03:23
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 21 Oct 2009, 07:47
Imho SMC (as in a function that patches itself on every entry) usually isn't good these days, because of penalties already mentioned. But runtime generated code (generate once, execute many) can work pretty well Smile
Post 21 Oct 2009, 07:47
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.