Rambo - First blood, part II:
I've always believed that the mind's the best weapon.

I will show the round methods how to defeat one top compressor using the best weapon, our mind.
I won't present universal compressor like upx.
Compressor presented here is suitable only for FASM produced PE32+ executables because requires special modification of import section (which in the fact doesn't contain imports anymore but pointers to imports in data section only). You have to adapt its sources if you want to compress anything else than FASM produced win64 executables.

----------

Reducing the size of executables.

Migrating to win64, times of old good DOS are lost. Todays there isn't any possibility to make programs like DOS *.com executables. Under win, we usually waste a lot of bytes in exe header and nulls for padding sections for default file alignment 200h.

***
1.*
***
If PE32+ execubable has at least 4 sections, the default FASM compiled exe file has header of 400h bytes.
Using smaller header we can reduce header size to 200h bytes for 4 section (4 sections are usually enough for most purposes - code, data, resource, imports).
; --- hdr.asm file ---
format MZ
; --- that's all, just compile it to produce hdr.exe with size of 32 bytes ---

; sample of win64 asm file begin:
format PE64 GUI at 100000000h on 'hdr.exe'

***
2.*
***
The most usable tricks for reduce code section are:
zeroing rax, rbx, rcx, rdx, rbp, rsi, rdi:
xor rax,rax uses 3 bytes (1 byte of REX prefix and 2 bytes of instruction)
AMD64 allows us to use trick with 64-bit zero extending of operations with 32-bit destination register:
xor eax,eax reliably zeroes the whole rax register in 2 bytes instruction
This trick doesn't matter for operations with r8-r15 registers, xoring r8d as well r8 requires 3 bytes in both cases.
If you know that any of your register is zero, there is a trick to put value between -80 and +7Fh into it:
e.g.
xor ebx,ebx ; rbx=0
lea ecx,[rbx+2] ; rcx=2
lea edx,[rbx-5] ; edx=-5
both lea instructions has 3 bytes and are the smallest possibility (e.g. lea rcx,[rbx+2] as well lea ecx,[ebx+2] have 4 bytes because REX prefix)
another possibility to do it in 3 bytes is
push 2 ; 2 bytes
pop rcx ; 1 byte
push -5 ; 2 bytes
pop rdx ; 1 byte

Note the difference: lea edx,[rbx-5] does rdx=00000000FFFFFFFB, push -5 pop rdx does rdx=FFFFFFFFFFFFFFFB

Trick for put -1 into 32-bit or 64-bit register:
or ecx,-1 ; 2 bytes, ecx=FFFFFFFF rcx=00000000FFFFFFFF
or rdx,-1 ; 3 bytes, rdx=FFFFFFFFFFFFFFFF

Sometimes we can replace mov with smaller xchg (it costs performance penalty and destroing source register)
mov rbx,rax ; 3 bytes
xchg rbx,rax ; 2 bytes

As a general rule, try to use 64-bit register in indexes e.g. [rax+3] and 32-bit register extra brackets e.g. mov eax,[...].
Avoid using of unncessary prefix bytes (REX prefix - bytes 40-4F for 64-bit registers, 66 prefix for 16-bit registers, 67 prefix for 32-bit indexing).
But keep in mind and avoid mistaken with omitting of prefixes when they are essential !
Just imagine this mistake: you want to put -7 into RDX and you know that RAX=0
This 3 bytes instruction lea edx,[rax-7] is mistake beacause -7 is put into EDX and the upper dword of RDX is zero-extended, thus the result in RDX is 000000000FFFFFFF9
correct possibility is lea rdx,[rax-7] which takes 4 bytes (push -7, pop rdx is correct and takes 3 bytes only, but is slower)
mov rdx,-7 is terribly huge, uses 6 bytes

the smallest routines for find size of 0-terminated string
size with 0-terminator
lea rdi,[string]
or ecx,-1 ; 3 bytes (PE32+ limits exe size to 2 GB thus FFFFFFFFh is enough for every string...
xor eax,eax ; 2 bytes
cld ; 1 byte
repnz scasb ; 2 bytes
not ecx ; 2 bytes
; ecx=size

size without 0-terminator
lea rdi,[string]
or ecx,-1 ; 3 bytes, ecx=FFFFFFFFh
xor eax,eax ; 2 bytes
lea edx,[rcx-1] ; 3 bytes, edx=FFFFFFFEh
cld ; 1 byte
repnz scasb ; 2 bytes
sub edx,ecx ; 2 bytes
; edx=size

***
3.*
***
Removing [ExitProcess]
if your imports section exceedes only few bytes after default 200h file alignment, you can remove this API and save 200h bytes of final exe size

start:
sub rsp,8*(4+11)
... code
call qword [ExitProcess]

this is replacement:
sub rsp,8*(4+11)
... code
add rsp,8*(4+11)
xor eax,eax
ret

Why is it possible ?
kernel32.dll:
0000000078D59630 48894C2408		BaseProcessStart: mov [rsp+08],rcx ; put address of exe entrypoint into the stack
0000000078D59635 4883EC28		sub rsp,28
0000000078D59639 41B908000000		mov r9d,00000008
0000000078D5963F 4C8D442430		lea r8,[rsp+30]
0000000078D59644 418D5101		lea edx,[r9+01]
0000000078D59648 48B9FEFFFFFFFFFFFFFF	mov rcx,FFFFFFFFFFFFFFFE ; -2=hThread of current process
0000000078D59652 FF151080FEFF		call qword [0000000078D41668] ; []=0000000078EF1330=ntdll.NtSetInformationThread
0000000078D59658 FF542430		call qword [rsp+30] ; call the entrypoint
0000000078D5965C 8BC8			mov ecx,eax ; here we return from executable
0000000078D5965E E8BDDB0000		call 0000000078D67220 ; KERNEL32.ExitThread
0000000078D59663 CC			int3

apropos, subtracting/adding values from/to rsp: if value is smaller than 80h, instruction is relatively small, but if you need 80h or more then instruction is teribly huge.
If you want to add 80h, intruction add rsp,80h is huge but it may be replaced with sub rsp,-80h which is smaller.
Please avoid to make great mistake by trying to encode it with esp instead of rsp (removing REX prefix). You can't be sure wheter stack isn't allocated above address 0000000100000000h
e.g. rsp=0000000100001568h
sub esp,8*(4+1)
rsp=0000000000001540h and you make access violation when accessing stack.

***
4.*
***
putting sections together
If you look into code and data section with hexa viewer and you decide that big tail of code and data is padded with nulls, you can try to put both section together and you may save 200h bytes of padding.
Then make the whole section executable+readable+writeable. If you forget to make it executable, then you make access violation when executing the first instruction. If you forget to make it writeable, you make access violation in first writing into section.

***
5.*
***
Removing imports section and doing imports manually.
Viruses usually use this heavy crazy way, but if you need only e.g. 3 imports (CreateFileA, WriteFile, CloseHandle - ExitProcess isn't necessary) then why to waste 200h for almost empty imports? We can do it manually in less bytes in code section!
Unfortunatelly, I discovered limits under Vista RC2 (but problems may disappear in RTM version of Vista...)
So, this method works well on xp x64 and w2k3 x64. Ugly and hardly fixable facts under Vista RC2:
Vista replaces some kernel32.dll APIs with others from ntdll.dll, e.g. kernel32.ExitPRocess -> ntdll.RtlExitUserProcess - I found about 100 APIs of kernel32.dll. GetProcAddress outputs ntdll APIs automatically, but manual way returns pointer to ascii string of API (e.g. ntdll.ExitUserProcess) instead of address of procedure (there is no procedure, ANSI string of ntdll API name replacement resides on address of API instead of instructions).
This is good for making malware's life complicated, but it complicates life of asm coders too.
If you want, I can post small routines for making imports manually (they are much more smaller than smallest import section 200h bytes), but again, this is very unsuitable for Vista x64.
OK, this is sample:
	push	60h
	pop	rdx			; rdx=60h
	gs mov	rcx,qword [rdx]		; Peb
	mov	rax,qword [rcx+8*3]	; PebLdr
	mov	rsi,qword [rax+8*6]	; Ldr.InInitializationOrderModuleList
	cld
	lodsq				; rsi+8 skip ntdll.dll
	mov	rbx,[rax+8*2]		; kernel32.dll base

	lea	rdi,[aLoadLibraryA]	; point to the ANSI string of the name of one usefull kernel32.dll API
	call	aLib2addr
	jc	reconstruct_imports_exit_error

	lea	rcx,[...]		; 0-terminated string of DLL name, e.g. db 'user32.dll',0
	call	rax			; call LoadLibraryA
	or	rax,rax
	jz	reconstruct_imports_exit_error
	xchg	rbx,rax

	lea	rsi,[...]		; where to store addresses of APIs
	lea	rdi,[...]		; chain of 0-terminated strings of API names terminated with one 0-byte
					; e.g. db 'MessageBoxA',0,'SendMessageA',0,0

reconstruct_imports_L1:
; RBX = DLL base , RDI = API name (0-terminated string)
	call	aLib2addr
	jc	reconstruct_imports_exit_error
	mov	qword [rsi],rax
	lodsq				; trick only for rsi+8

; point to next 0-terminated string
	or	ecx,-1
	xor	eax,eax
	repnz scasb

; rax=0 now
	cmp	byte [rdi],al		; end of API names?
	jnz	reconstruct_imports_L1

reconstruct_imports_exit:
... do what you want...

reconstruct_imports_exit_error:
... inform user that we failed and exe will have to be terminated

aLib2addr:
; in:		rbx Dll base
;		rdi API name 0-terminated string
; out:		CF=0 success -> rax=address
;		CF=1 no such API found, rax undefined (destroyed)
; destroys:	rcx rdx
	push	rbp rsi rdi

; string size
	or	ecx,-1		; 3 bytes instruction
				; makes ecx=FFFFFFFFh, enough for every todays string size...
	xor	eax,eax
	cld
	repnz scasb
	not	ecx		; FFFFFFFE->1, FFFFFFFD->2, ...
				; ecx=string size (include 0-terminator into its size)

	push	rcx

; if you want to increase safety of checking, uncomment next 8 commented lines (4 checks)
;	cmp	[rbx+IMAGE_DOS_HEADER.e_magic],IMAGE_DOS_SIGNATURE
;	jnz	aLib2addr_fail
	mov	eax,[rbx+IMAGE_DOS_HEADER.e_lfanew]	; File address of new exe header ...[rbx+3Ch] raw encoded
;	cmp	[rbx+rax*1+IMAGE_NT_HEADERS64.Signature],IMAGE_NT_SIGNATURE
;	jnz	aLib2addr_fail
;	cmp	[rbx+rax*1+IMAGE_NT_HEADERS64.FileHeader.Machine],IMAGE_FILE_MACHINE_AMD64
;	jnz	aLib2addr_fail
;	cmp	[rbx+rax*1+IMAGE_NT_HEADERS64.OptionalHeader.Magic],IMAGE_NT_OPTIONAL_HDR64_MAGIC
;	jnz	aLib2addr_fail
	mov	edx,[rbx+rax*1+IMAGE_NT_HEADERS64.OptionalHeader.DataDirectory.VirtualAddress + (sizeof.IMAGE_DATA_DIRECTORY)*IMAGE_DIRECTORY_ENTRY_EXPORT]	; export RVA ...[rbx+rax*1+88h] raw encoded
	add	rdx,rbx
; rdx = DIRECTORY_ENTRY_EXPORT

	xor	ebp,ebp			; counter
	mov	eax,[rdx+IMAGE_EXPORT_DIRECTORY.AddressOfNames]	; ...[rdx+20h] raw encoded
	add	rax,rbx
aLib2addr_L0:
	mov	esi,dword [rax+rbp*4]
	add	rsi,rbx

	pop	rcx rdi
	push	rdi rcx
	repz cmpsb
	jz	aLib2addr_L1

	inc	ebp
	cmp	ebp,[rdx+IMAGE_EXPORT_DIRECTORY.NumberOfNames]	; [rdx+18h] raw encoded
	jc	aLib2addr_L0

aLib2addr_fail:
	stc
	jmp	aLib2addr_epi

aLib2addr_L1:
	mov	ecx,[rdx+IMAGE_EXPORT_DIRECTORY.AddressOfNameOrdinals]	; [rdx+24h] raw encoded
	add	rcx,rbx
	movzx	eax,word [rcx+rbp*2]
	mov	ecx,[rdx+IMAGE_EXPORT_DIRECTORY.AddressOfFunctions]	; [rdx+1Ch] raw encoded
	add	rcx,rbx
	mov	eax,dword [rcx+rax*4]
	add	rax,rbx			; this clears carry too
aLib2addr_epi:
	pop	rcx rdi rsi rbp
	ret

aLoadLibraryA	db	'LoadLibraryA',0

***
6.*
***
Yeah, and final and the hottest stuff - compressing.
If your app is big enough, you can feel benefit of compression (reduction of original size by compression must be greather than the size of decompressor).
6A. compressing
6B. filtering (=precompression modifying)
6C. coding style for making compression more efficient

; -----

6A.
There are many algorythms for compression. We will focus only one of them which is easy understandable and its decompress routine is very small (has about 115 bytes).
This type of compression is based on repetitive strings. Just imagine how to compress this input:
48 8D 15 5B 62 01 00 48 8D 0D 42 62 01 00
first 7 bytes can't be compressed
next 2 bytes 48 8D can be compressed, we just encode that we want 2 bytes from position at 7 bytes back. We must encode it in less than 16 bits (2 bytes) for make the compression efficient
next 2 bytes 0D 42 can't be compressed
next 3 bytes 62 01 00 can be compressed, we encode that we want 3 bytes from position at 7 bytes back.
How to do it efficiently? Of course we must use bits to do it efficiently.
So direct jump into the practice:
We need a routine named get_next_bit, this is its sample:
get_next_bit:
add dl,dl ; another possibility: shl dl,1
jnz no_new_byte
lodsb
mov dl,al ; another possibility: xchg edx,eax
adc dl,dl ; another possibility: rcl dl,1
no_new_byte:
ret

this is its usage:
lea rsi,[compressed_data]
lea rdi,[decompressed_out_buffer]
cld
mov dl,80h
B0:
call get_next_bit
jc L1
L0: ... some decompress instructions ...
jmp B0
L1: ... some decompress instructions ...
jmp B0

Note about two instructions: MOV DL,80h and ADC DL,DL.
MOV DL,80h sets up the first control_bit, but this isn't really control_bit used for switch decompress between L0 and L1. Binary, 80h = 10000000b and highest bit (bit 7.) of 80h is 1. All other bits=0 (bits 6. 5. 4. 3. 2. 1. 0.). At this moment, highest bit name can be helper_control_bit. Helper_control_bit is never destroyed till decompress process ends (isn't destroyed at the end either). Helper_control_bit is the highest bit of 80h for the whole decompress process, it never losts, it is never recreated from anything else as the startup 80h. Helper_control_bit recycles through instruction ADC DL,DL after every bit of loaded bits (8 bits by LODSB) is used (8 calls of get_next_bit). Image of the first call get_next_bit and call get_next_bit after using and removing all control_bits (every 8th call, so call No. 1, No. 9, No. 17, ...) is the same:
DL register = 80h = 10000000b
1. ADD DL,DL   80h + 80h = 00h CarryFlag=1 ZeroFlag=1 (helper_control_bit is coming now into the Carry Flag)
2. LODSB loads control_byte with 8 control_bits, this instruction doesn't touch Carry Flag
3. XCHG EDX,EAX swaps control_byte to DL register, this instruction doesn't touch Carry Flag (note that instructions PUSH,POP,MOV,XCHG,INC,LODSB,... don't change Carry Flag)
4. ADC DL,DL recycles helper_control_bit, shifts all bits left by 1 + puts the bit from Carry Flag into bit 0. + highest control_bit goes into the Carry Flag

That was the most complicated thing to explain. No we will explain something more usefull. You needn't to worry if you haven't understood the above explanation, just take it as a fact that get_next_bit gets 1 bit from the input...

Instructions on L0 and L1 can be something like:
L0: MOVSB
JMP B0
L1: ... calculate ECX
    ... calculate EBX (delta, shift)
PUSH RSI
LEA RSI,[RDI]
SUB RSI,RBX
REPZ MOVSB
POP RSI
JMP B0
First mode, L0, isn't true decompress mode. Byte isn't compressed and it will be moved only. This mode has bad pack ratio, but must be used for store the bytes which can't be decompressed by L1 mode. It uses 1 byte + 1 bit = 9 bits for store 1 byte = 8 bits.
Second mode, L1, is true decompress mode. It calculates ECX number of bytes for decompress and calculates EBX, value that can be named as DELTA or SHIFT.
It has good pack ratio, better for large chains (big ECX) and small shift (small EBX). Methods for calculate ECX and EBX are the same: It's clear that ECX as well EBX aren't zero (ECX<>0 EBX<>0) hence the highest bit of register = 1
First instruction for calculating ECX sets up highest bit=1 and all next bits will be put by call get_next_bit. First instruction is:
MOV ECX,1 ; or INC ECX if ECX=0.
Next instructions are:
CALL get_next_bit
ADC ECX,ECX ; as well RCL ECX,1 can be used
How to terminate calculating of ECX ? Again using of call get_next_bit ! Here is full routine for calculate ECX in decompress:
MOV ECX,1
LCC0: CALL get_next_bit
ADC ECX,ECX
CALL get_next_bit
JC LCC0

A minimal value ECX=2 can be produced by the above code. ECX=1 isn't necessary because this is handled in L0 mode (MOVSB) and L0 is more efficient (even it has bad pack ratio) for "pack" 1 byte than L1 mode.

Example for calculating ECX=5=101b
Highest bit is done by INC ECX and we remove it - binary 01b
Bit sequence for calculate ECX=5 is 01 10 binary.
Calculating ECX=110100b
Remove highest bit (this bit is put INC ECX in decompress) - binary 10100b
Bit sequence for calculate ECX is 11 01 11 01 00 binary.
Calculate ECX=2=10b. Bit sequence is 0 0 binary.
Calculate ECX=3=11b. Bit sequence is 1 0 binary.
Calculate ECX=4=100b. Bit sequence is 0 1 0 0 binary.
Calculate ECX=5=101b. Bit sequence is 0 1 1 0 binary.
Calculate ECX=6=110b. Bit sequence is 1 1 0 0 binary.
Calculate ECX=7=111b. Bit sequence is 1 1 1 0 binary.
Calculate ECX=8=1000b. Bit sequence is 0 1 0 1 0 0 binary.
Calculate ECX=16=10000b. Bit sequence is 0 1 0 1 0 1 0 0 binary.
Calculate ECX=17=10001b. Bit sequence is 0 1 0 1 0 1 1 0 binary.
Calculate ECX=18=10010b. Bit sequence is 0 1 0 1 1 1 0 0 binary.
Calculate ECX=19=10011b. Bit sequence is 0 1 0 1 1 1 1 0 binary.

Calculating of EBX has some similar steps but some different steps. EBX may be EBX=1 and can be done in this way:
MOV EBX,1
LCD0: CALL get_next_bit
ADC EBX,EBX
CALL get_next_bit
JC LCD0
DEC EBX

But by experiments, we often seem EBX>16 and for EBX<16 we can be used another decompress mode. Calculating of EBX=15 requires 8 bits = 1 byte by using the above code. There is a better way using 8 bits = 1 byte for fill BL in EBX and calculate all bits higher of BL ( bits 31. - 8. ) by mode similar to calculating of ECX. Here is it:
MOV EBX,1
LCD0: CALL get_next_bit
ADC EBX,EBX
CALL get_next_bit
JC LCD0
DEC EBX
DEC EBX
SHL EBX,8
MOV BL,byte [RSI]
INC RSI
Note that at least double DEC EBX must be used for make EBX=0 possibility before SHL EBX,8 shifts all bits higher and frees BL.

There is a mode named without_change_delta. The main principle is triple usage of DEC EBX after calculate EBX=2. Calculating EBX=-1 indicates that calculating of new delta isn't necessary and old delta can be used. Old delta can be saved to an unused register or the stack by previous SUB RSI,RBX REPZ MOVSB and restored in mode without_change_delta.

Principle of the mode for pack 2-3 bytes with delta between 1 and 7Fh:
1. Load 1 byte = 8 bits
2. bit 0. = 1 indicates packed 2 bytes
   bit 0. = 0 indicates packed 3 bytes
3. high 7 bits ( bits 7. - 1. ) is delta
Here is code example
XOR EBX,EBX ; (EBX=0)
MOV ECX,1   ; (ECX=1)
MOV BL,[RSI]
INC RSI
SHR BL,1 ; this explore bit 0. and shift bits to make EBX=delta
SBB CL,0
INC ECX
INC ECX
It's clear that result BL=0 after this code is impossible delta. We may use it for TERMINATE decompress process.
A nice idea for pack 1 byte with delta between 1 and 15:
XOR EBX,EBX
MOV ECX,1
U02: MOV BL,00010000b
CALL get_next_bit
ADC BL,BL
JNC U02
Result EBX=0 is impossible delta and it is used for pack byte 00h. The byte 00h is the most frequent byte in 64-bit opcodes. Last code continues...
JNZ STORE_1_BYTE
XCHG EBX,EAX ; make EAX=0 in 1 byte instruction
JMP STORE_BYTE
...
STORE_1_BYTE:
NEG RBX
MOV AL,[RDI+RBX*1]
STORE_BYTE:
STOSB

This is all about decompress intro. There are still unimplemented parts in the decompress like:
CMP EBX,7D00h
JNC WE_ADD_TWO
CMP EBX,500h
JNC WE_ADD_ONE
JMP DONT_TOUCH_IT
WE_ADD_TWO:
INC ECX
WE_ADD_ONE:
INC ECX
DONT_TOUCH_IT:
(small improvement: cmp ebx,7D00h / sbb ecx,-1 instead of cmp ebx,7D00h / jnc / inc ecx)
It's not efficient to compress 2 bytes with delta > 4FFh because this requires 2+(3*2)+8+2 = 18 bits and this can be done with double usage of MOVSB mode (2*9=18 bits):
U00: movsb ; requires 1 byte = 8 bits
call get_next_bit ; require 1 bit
jnc U00

It is unefficient to compress 3 bytes with delta > 7CFFh because it requires 2+(8*2)+8+(2*2) = 28 bits without, 26 bits with this implementation and can be done more efficiently by triple usage of movsb (3*8=24 bits).

Intro for COMPRESS...
---------------------
Some equivalents:
DECOMPRESS         COMPRESS
MOV DL,80h         CALL o_c_0  ; setup helper_control_bit
CALL get_next_bit  CALL put_bit

Compress routines aren't much clever yet, compressor simply takes the longest nearest chain. It would be good to pass compress input through back-compression (compression from the back, not from the begin) and then reverse the result and analyze it from the begin. And try to analyze every chain, not only the longest and nearest.

6B. filtering (=precompression modifying). What does it mean?
Imagine this input code:
addr:   hexa:      asm:
401000: B82F510300 mov eax,0003512Fh
401005: E8AB6B0000 call hexa_32_eax
40100A: B86A54DE00 mov eax,00DE546Ah
40100F: E8A16B0000 call hexa_32_eax
How to optimize dword after E8 (call) - What happens when we add address to hexa_dword? 00006BABh+401005h=407BB0h 00006BA1h+40100Fh=407BB0h
so we get this input stream after filtering:
B82F510300
E8B07B4000
B86A54DE00
E8B07B4000
which is much better compressible because we can compress the last 6 bytes with pack mode on delta=10
Another optimalization trick is to bswap dword after adding address
just imagine this input code:
401000: B82F510300 mov eax,0003512Fh
401005: E8AB6B0000 call hexa_32_eax
40100A: 48B86A54DEFF mov rax,-0021AB96h ; -0021AB96h=FFDE546Ah
401010: E8E06B0000 call hexa_64_rax
00006BABh+401005h=00407BB0h 00006BE0h+401010h=00407BF0h then we bswap result and we get this input stream:
B82F510300
E800407BB0
48B86A54DEFF
E800407BF0
which is much better compressible because we can compress the first 4 bytes of last 5 bytes on delta=11

We can aply filters on various instructions, like E8 call, E9 jmp, 0F80-0F8F conditional jumps, 8D05 8D0D 8D15 8D1D 8D25 8D2D 8D35 8D3D lea grp,[rip+...], FF15 call qword [rip+...], 88 89 8A 8B followed with byte 05 0D 15 1D 25 2D 35 3D for mov gpr,[rip+...] mov [rip+...],gpr
You can create your own filter depending of frequency of well filterable instrucions, like cmp [rip+...],gpr add [rip+...],gpr etc.
Please note that filtering may be contraproductive in some situations, just imagine this:
00000001000010B6: 48898424E8000000 mov [rsp+000000E8],rax
If we apply filter in this situation (for dword following byte E8), we change well compressible last 3 zero bytes to scrap.
But we can use simple trick to enhance filters, e.g. if we know that our code section is smaller than 64 kB = 10000h bytes, then last 2 bytes of dword may be FFFF or 0000 only (note that imports - FF15 call qword [rip+...] - could be further than 10000h in import section) - filter is more efficient after checking whether upper word of the dword is FFFF or 0000 - in the above situation, filter can be faked only if the next instrution begins with byte 00 which is instruction ADD. For 10000h=<code section=<20000h we have to check upper word of the dword for FFFE, FFFF, 0000, 0001 etc.
If we check for 2 bytes like FF15, FF80-FF8F we can relax our checking of high word of the dword since instruction e.g. 48898424FF150000 should be rarer than 48898424E8000000 (checking byte E8 is checking 8-bit pattern, checking 0F80-0F8F is 8+4=12-bit pattern = 16 times less probability for filter inappropriate dword in comparision with 8-bit pattern E8, checking FF15 is 16-bit pattern so = 256 times less probabilty for filter inappropriate dword in comparision with 8-bit pattern E8).
Note! If precompress filter checks upper word of dword before filtering and refuses to filter a dword, then back-filter after decompress must check the same condition. This may lead into mistakes and inappropriate restoring!!! Current compressor doesn't check it, IT MAY produce bad data!!! You have to check the decompressed image manually by comparing it with image of uncompressed executable loaded in memory.

Another kind of filter for resources:
If your resource doesn't contain bitmaps, icons, cursors, then we can assume that the biggest part are unicode strings. In first step, we can compress bytes 0., 2., 4., ..., in second turn we compress bytes 1., 3., 5., ...

6C. coding style for making compression more efficient
just imagine these 2 prologues:
proc00:
push rcx rdx r8 r9	; 51 52 4950 4951
sub rsp,8*5		; 4883EC28
...

proc01:
push rcx rdx r8 r9 r15	; 51 52 4950 4951 4957
sub rsp,8*4		; 4883EC20
...

we can easily improve proc01 for better compression by rearranging pushes:
proc01:
push r15 rcx rdx r8 r9	; 4957 51 52 4950 4951
sub rsp,8*4		; 4883EC20
Rearranging allows to compress the whole string 51 52 4950 4951 48 83 EC in one turn. There are hundreds of situations when using of your art may improve compressibility ratio.

; --------------------------------------------------

To study decompressor and imports recreator, just
1. launch fdbg.fp.exe,
2. open itself as a debuggee (CTRL+E) FileName=fdbg.fp.exe
3. put HARDWARE breakpoint (Alt-B, H) No.=0 address=00000001001387DA e/r/w=execute (at zeros after first call)
   (note that if you use software breakpoint you may destroy the rest of decompressor because byte 0CC overwrites byte in part used as source for decompress the rest !
4. run (F9)
   now you can safe use software breakpoints

Some statistics from fdbg000D compression:
the whole code+data+resource E200h+5200h+2C00h=16000h=90112 bytes
- winrar 3.50 beta4 compression method best
  result: 36253 bytes (40.23% of original size)
- compressor.exe : filtered 0B5Ch=2908 items (18,11% of 16061 instructions of code section)
  result: 868Fh=34447 bytes (=38.23% of original size 90112 bytes, but 38.01% of 90624 bytes if we encounter 200h bytes of removed unusefull last imports)
  size including 200h of header, 200h of new empty resource+compressed part of decompressor, 200h bytes of decompressor = 35840 bytes - still less than WinRAR output !!!
  (compressor.exe without filtering: 97EAh=38890 bytes = 43.16% of original size)

code section:
winrar 57856->27057 (46.77% of original code section size)
compressor.exe with filtering 60EDh=24813 (42.89%)
(compressor.exe without filtering 7248h=29256 (50.57%))

data section:
winrar 20992->6299 (30.00%)
compressor.exe 192Ah=6442 (30.69%)

resource:
winrar 11012->2689 (24.42%)
compressor.exe C78h=3192 (28.99%) (9EDh=2541 for bytes 0., 2., 4., ...11010.  + 28Bh=651 for bytes 1., 3., ...11011.)
(compressor.exe without unicode strings filtering 3277 (29.76%))

Conclusion:
Decompression routine with size of 115 bytes can hardly be more efficient than years developped winrar. Winrar defeates us in compression of every section. But using our mind (knowing the frequent instructions: RIP-relative addressing of long mode, call, jmp, conditional jmp), we are able to optimize code section so much, that as the result, we (34308 bytes) are able to defeat winrar (36383 bytes). Even the final executable (35840 bytes=include header+decompressor+padding_with_zeros) is smaller than WinRAR output 36563 bytes (35635 include headers+checksum+...)

Rambo - First blood, part II:
I've always believed that the mind's the best weapon.
