flat assembler
Message board for the users of flat assembler.
Index
> Windows > Turn procedure into Macro |
Author |
|
MacroZ 12 Oct 2018, 20:53
I managed to put together something. I don't know if this is "sustainable", it takes a lot of if - then statements to get this perfect. Is there an easier way to do it?
Code: ;############################################################################################################## ; Copy a block of memory from one address to another ; Entry: ; CpySize = Number of bytes to copy (Imm32/Label or ebx) ; Src = Source address (Imm32/Label or esi) ; Dst = Destination address (Imm32/Label or edi) ; Used Regs: ; ebx esi edi (If CpySize > 12, caller must save these first) ; Return: ; None macro _m_MemCopy CpySize*,Src*,Dst* { if ~ CpySize eqtype eax if CpySize > 0 if CpySize = 1 mov al,byte [Src] mov byte [Dst],al else if CpySize = 2 mov ax,word [Src] mov word [Dst],ax else if CpySize = 3 mov ax,word [Src] mov cl,byte [Src+2] mov word [Dst],ax mov byte [Dst+2],cl else if CpySize = 4 mov eax,dword [Src] mov dword [Dst],eax else if CpySize = 5 mov eax,dword [Src] mov cl,byte [Src+4] mov dword [Dst],eax mov byte [Dst+4],cl else if CpySize = 6 mov eax,dword [Src] mov cx,word [Src+4] mov dword [Dst],eax mov word [Dst+4],cx else if CpySize = 7 mov eax,dword [Src] mov cx,word [Src+4] mov dl,byte [Src+6] mov dword [Dst],eax mov word [Dst+4],cx mov byte [Dst+6],dl else if CpySize = 8 mov eax,dword [Src] mov ecx,dword [Src+4] mov dword [Dst],eax mov dword [Dst+4],ecx else if CpySize = 9 mov eax,dword [Src] mov ecx,dword [Src+4] mov dl,byte [Src+8] mov dword [Dst],eax mov dword [Dst+4],ecx mov byte [Dst+8],dl else if CpySize = 10 mov eax,dword [Src] mov ecx,dword [Src+4] mov dx,word [Src+8] mov dword [Dst],eax mov dword [Dst+4],ecx mov word [Dst+8],dx else if CpySize = 11 mov eax,dword [Src] mov ecx,dword [Src+4] mov dx,word [Src+8] mov dword [Dst],eax mov dword [Dst+4],ecx mov word [Dst+8],dx mov al,byte [Src+10] mov byte [Dst+10],al else if CpySize = 12 mov eax,dword [Src] mov ecx,dword [Src+4] mov edx,dword [Src+8] mov dword [Dst],eax mov dword [Dst+4],ecx mov dword [Dst+8],edx else cld if ~ Src eqtype eax mov esi,Src end if if ~ Dst eqtype eax mov edi,Dst end if if CpySize mod 4 = 0 mov ecx,CpySize shr 2 rep movsd else mov ecx,CpySize shr 2 rep movsd mov ecx,CpySize and 3 rep movsb end if end if end if else cld mov ecx,CpySize if ~ Src eqtype eax mov esi,Src end if if ~ Dst eqtype eax mov edi,Dst end if shr ecx,2 rep movsd mov ecx,CpySize and ecx,3 rep movsb end if } ;############################################################################################################## |
|||
12 Oct 2018, 20:53 |
|
Furs 13 Oct 2018, 12:36
Looks good to me. You could of course "share" some of the code since some cases are similar, but if you find it easier this way just keep it.
Your MemCopy function is not very efficient for the last rep movsb, since those instructions have some overhead and are useful when copying large blocks. If you target newer CPUs with fast "rep movs" (I assume you do since you use it in the first place), then just use rep movsb for the entire thing. The CPU is smart enough to do the large copy in the most optimal way. (note that it should be done for larger blocks only). The beauties of proper CISC. If you target older CPUs also, then well "rep movs" is not a fast way to copy memory, unfortunately. |
|||
13 Oct 2018, 12:36 |
|
MacroZ 13 Oct 2018, 15:50
I haven't tested rep movsb alone, I will test it. But on 64-bit and a fairly new computer rep movs should probably be avoided all together. In my experience, rep stos instructions are very fast (superior) to regular instructions but rep mov instructions are very slow and should be avoided. I tried creating a 64-bit memcopy routine some years back using rep mov, and it was not good compared to regular instructions. Here is the 64-bit memcopy using regular instructions (Both macro and procedure)
Code: ;############################################################################################################## ; Copy a block of memory from one address to another ; Entry: ; CpySize = Number of bytes to copy (Imm64/Label or rcx) ; Src = Source address (Imm64/Label or rdx) ; Dst = Destination address (Imm64/Label or r8 ) ; bRbx = Set to TRUE to allow the use of rbx register or FALSE if not ; bRsi = Set to TRUE to allow the use of rsi register or FALSE if not ; bRdi = Set to TRUE to allow the use of rdi register or FALSE if not ; bR12 = Set to TRUE to allow the use of r12 register or FALSE if not ; bR13 = Set to TRUE to allow the use of r13 register or FALSE if not ; bR14 = Set to TRUE to allow the use of r14 register or FALSE if not ; bR15 = Set to TRUE to allow the use of r15 register or FALSE if not ; Used Regs: ; rbx rsi rdi and r12-r15 (If caller set them to be used in the arguments) ; Return: ; None macro _m_MemCopy CpySize*,Src*,Dst*,bRbx*,bRsi*,bRdi*,bR12*,bR13*,bR14*,bR15* { local loop32,check8,check4,check1,loop1,bye maxsize = 39+(bRbx*8)+(bRsi*8)+(bRdi*8)+(bR12*8)+(bR13*8)+(bR14*8)+(bR15*8) if Src eqtype rax maxsize = maxsize - 8 end if if Dst eqtype rax maxsize = maxsize - 8 end if if ~ CpySize eqtype rax & CpySize > 0 & CpySize <= maxsize qcount = 0 current_offset = 0 if CpySize/8 > qcount qcount = qcount + 1 mov rax,qword [Src] current_offset = current_offset + 8 end if if bRbx = 1 if CpySize/8 > qcount qcount = qcount + 1 mov rbx,qword [Src+current_offset] current_offset = current_offset + 8 end if end if if bRsi = 1 if CpySize/8 > qcount qcount = qcount + 1 mov rsi,qword [Src+current_offset] current_offset = current_offset + 8 end if end if if bRdi = 1 if CpySize/8 > qcount qcount = qcount + 1 mov rdi,qword [Src+current_offset] current_offset = current_offset + 8 end if end if if bR12 = 1 if CpySize/8 > qcount qcount = qcount + 1 mov r12,qword [Src+current_offset] current_offset = current_offset + 8 end if end if if bR13 = 1 if CpySize/8 > qcount qcount = qcount + 1 mov r13,qword [Src+current_offset] current_offset = current_offset + 8 end if end if if bR14 = 1 if CpySize/8 > qcount qcount = qcount + 1 mov r14,qword [Src+current_offset] current_offset = current_offset + 8 end if end if if bR15 = 1 if CpySize/8 > qcount qcount = qcount + 1 mov r15,qword [Src+current_offset] current_offset = current_offset + 8 end if end if if CpySize/8 > qcount qcount = qcount + 1 mov rcx,qword [Src+current_offset] current_offset = current_offset + 8 end if if ~ Src eqtype rax if CpySize/8 > qcount qcount = qcount + 1 mov rdx,qword [Src+current_offset] current_offset = current_offset + 8 end if end if if ~ Dst eqtype rax if CpySize/8 > qcount qcount = qcount + 1 mov r8,qword [Src+current_offset] current_offset = current_offset + 8 end if end if if CpySize and 4 = 4 mov r9d,dword [Src+current_offset] current_offset = current_offset + 4 else if CpySize/8 > qcount qcount = qcount + 1 mov r9,qword [Src+current_offset] current_offset = current_offset + 8 end if if CpySize and 2 = 2 mov r10w,word [Src+current_offset] current_offset = current_offset + 2 else if CpySize/8 > qcount qcount = qcount + 1 mov r10,qword [Src+current_offset] current_offset = current_offset + 8 end if if CpySize and 1 = 1 mov r11b,byte [Src+current_offset] current_offset = current_offset + 1 else if CpySize/8 > qcount qcount = qcount + 1 mov r11,qword [Src+current_offset] current_offset = current_offset + 8 end if current_offset = 0 qcount = 0 if CpySize/8 > qcount qcount = qcount + 1 mov qword [Dst],rax current_offset = current_offset + 8 end if if bRbx = 1 if CpySize/8 > qcount qcount = qcount + 1 mov qword [Dst+current_offset],rbx current_offset = current_offset + 8 end if end if if bRsi = 1 if CpySize/8 > qcount qcount = qcount + 1 mov qword [Dst+current_offset],rsi current_offset = current_offset + 8 end if end if if bRdi = 1 if CpySize/8 > qcount qcount = qcount + 1 mov qword [Dst+current_offset],rdi current_offset = current_offset + 8 end if end if if bR12 = 1 if CpySize/8 > qcount qcount = qcount + 1 mov qword [Dst+current_offset],r12 current_offset = current_offset + 8 end if end if if bR13 = 1 if CpySize/8 > qcount qcount = qcount + 1 mov qword [Dst+current_offset],r13 current_offset = current_offset + 8 end if end if if bR14 = 1 if CpySize/8 > qcount qcount = qcount + 1 mov qword [Dst+current_offset],r14 current_offset = current_offset + 8 end if end if if bR15 = 1 if CpySize/8 > qcount qcount = qcount + 1 mov qword [Dst+current_offset],r15 current_offset = current_offset + 8 end if end if if CpySize/8 > qcount qcount = qcount + 1 mov qword [Dst+current_offset],rcx current_offset = current_offset + 8 end if if ~ Src eqtype rax if CpySize/8 > qcount qcount = qcount + 1 mov qword [Dst+current_offset],rdx current_offset = current_offset + 8 end if end if if ~ Dst eqtype rax if CpySize/8 > qcount qcount = qcount + 1 mov qword [Dst+current_offset],r8 current_offset = current_offset + 8 end if end if if CpySize and 4 = 4 mov dword [Dst+current_offset],r9d current_offset = current_offset + 4 else if CpySize/8 > qcount qcount = qcount + 1 mov qword [Dst+current_offset],r9 current_offset = current_offset + 8 end if if CpySize and 2 = 2 mov word [Dst+current_offset],r10w current_offset = current_offset + 2 else if CpySize/8 > qcount qcount = qcount + 1 mov qword [Dst+current_offset],r10 current_offset = current_offset + 8 end if if CpySize and 1 = 1 mov byte [Dst+current_offset],r11b current_offset = current_offset + 1 else if CpySize/8 > qcount qcount = qcount + 1 mov qword [Dst+current_offset],r11 current_offset = current_offset + 8 end if else if (~ CpySize eqtype rax & CpySize > 0) | (CpySize eqtype rax) if ~ CpySize eqtype rax mov rcx,CpySize mov r9,CpySize if ~ Src eqtype rax mov rdx,Src end if if ~ Dst eqtype rax mov r8,Dst end if else mov r9,rcx if ~ Src eqtype rax mov rdx,Src end if if ~ Dst eqtype rax mov r8,Dst end if end if shr rcx,4 mov r11d,16 jz check8 align 8 loop32: mov rax,[rdx] mov r10,[rdx+8] lea rdx,[rdx+r11] mov [r8],rax mov [r8+8],r10 add r8,r11 sub rcx,1 jz check8 mov rax,[rdx] mov r10,[rdx+8] lea rdx,[rdx+r11] mov [r8],rax mov [r8+8],r10 add r8,r11 sub rcx,1 jnz loop32 check8: test r9d,8 jz check4 mov rax,[rdx] lea rdx,[rdx+8] mov [r8],rax add r8,8 check4: test r9d,4 jz check1 mov eax,[rdx] lea rdx,[rdx+4] mov [r8],eax add r8,4 check1: and r9,3 jz bye align 4 loop1: mov al,[rdx] lea rdx,[rdx+1] mov [r8],al add r8,1 sub r9,1 jnz loop1 bye: end if } ;############################################################################################################## Code: ;############################################################################################################## ; Copy a block of memory from one address to another ; Entry: ; rcx = Number of bytes to copy ; rdx = Source address ; r8 = Destination address ; Return: ; None proc MemCopy,CpySize,pSrc,pDst mov r9,rcx shr rcx,4 mov r11d,16 jz .check8 align 8 .loop32: mov rax,[rdx] mov r10,[rdx+8] lea rdx,[rdx+r11] mov [r8],rax mov [r8+8],r10 add r8,r11 sub rcx,1 jz .check8 mov rax,[rdx] mov r10,[rdx+8] lea rdx,[rdx+r11] mov [r8],rax mov [r8+8],r10 add r8,r11 sub rcx,1 jnz .loop32 .check8: test r9d,8 jz .check4 mov rax,[rdx] lea rdx,[rdx+8] mov [r8],rax add r8,8 .check4: test r9d,4 jz .check1 mov eax,[rdx] lea rdx,[rdx+4] mov [r8],eax add r8,4 .check1: and r9,3 jz .ret align 4 .loop1: mov al,[rdx] lea rdx,[rdx+1] mov [r8],al add r8,1 sub r9,1 jnz .loop1 .ret: ret endp ;############################################################################################################## |
|||
13 Oct 2018, 15:50 |
|
MacroZ 13 Oct 2018, 22:31
I would like some input on the latest macro, is it good? Is there anything in the macro variant that should be designed differently? Is there any places where match would be better to use and perhaps use sub-macros inside the main macro?
If anyone is alive on the forum (It doesn't seem like people are active here) |
|||
13 Oct 2018, 22:31 |
|
Furs 14 Oct 2018, 14:18
MacroZ wrote: I haven't tested rep movsb alone, I will test it. But on 64-bit and a fairly new computer rep movs should probably be avoided all together. I found this with a lot more info if you want https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy Your macro seems pretty large, but if it works then it's fine. match is used when you need to do symbol comparisons. Note that FASM has two stages. The preprocessor deals with text (symbols) and match is part of it. The second stage is assembly stage. if statements are part of assembly stage, so they can mostly refer to numbers only (with few exceptions, such as registers and the like). All the symbols/values defined in assembly stage (with = operator) can only contain such numbers or whatever. You can't contain arbitrary text/symbols, you need "equ" and "define" preprocessor for that. match is useful for macros with "custom syntax", instead of just passing parameters normally. e.g. you can extract a parameter that looks like "rdi:true" into "rdi" and "true" and do other sorts of text processing (both of those are text). But if you parameters like "true, true, true, false, true", you don't need it. Of course assuming true expands during preprocessing to some number (1?). Remember: variables during assembly stage don't contain arbitrary text, all of it has been replaced by preprocessor. You can actually not define "true" and "false" and use match to check for "=true" (literal text) on a parameter, if you want, instead of using if. Then, it will be replaced at preprocessing time. |
|||
14 Oct 2018, 14:18 |
|
MacroZ 14 Oct 2018, 17:32
Care to show me how you would do the macro prototype?
|
|||
14 Oct 2018, 17:32 |
|
donn 14 Oct 2018, 18:51
AMD also has some rep movs alternatives in Software Optimization Guide for AMD64 Processors pulled on 25112 Rev. 3.06 September 2005 Section 5.13. It's a bit old, so it's possible their method was superseded. They have examples, which I found interesting and implemented part of one and tested it myself. The alignment wasn't yet implemented, but copying seemed to work:
Code: mov [linearCopy.copyAddress], rcx mov [linearCopy.copyDestAddress], rdx mov [linearCopy.copySize], r8 .copySet: mov rsi, [linearCopy.copyAddress] mov rdi, [linearCopy.copyDestAddress] cld mov rax, [linearCopy.copySize] mov [linearCopy.copySizeRemainder], rax shr rax, 101b ; Divide by 32 mov [linearCopy.copySize], rax mov rax, [linearCopy.copySize] mov rdx, 0 mov rcx, 100000b imul rcx mov r10, [linearCopy.copySizeRemainder] sub r10, rax mov [linearCopy.copySizeRemainder], r10 mov rax, [linearCopy.copySize] cmp rax, 0 je linearCopy.smallCopyOnly ;and rsp, -32;align 16 ; Not working yet .copyLarge: ; Copy in chunks of 4 qwords. AMD Optimization recommendation. Compare with rep movsq. mov r8, [rsi] mov r9, [rsi+1000b] add rsi, 100000b movnti [rdi], r8 movnti [rdi+1000b], r9 add rdi, 100000b mov r8, [rsi-10000b] mov r9, [rsi-1000b] dec rax movnti [rdi-10000b], r8 movnti [rdi-1000b], r9 jnz linearCopy.copyLarge .smallCopyOnly: mov rcx, [linearCopy.copySizeRemainder] rep movsb mov rax, [linearCopy.copyDestAddress] |
|||
14 Oct 2018, 18:51 |
|
donn 14 Oct 2018, 18:53
AMD also has some rep movs alternatives in Software Optimization Guide for AMD64 Processors pulled on 25112 Rev. 3.06 September 2005 Section 5.13. It's a bit old, so it's possible their method was superseded. They have examples, which I found interesting and implemented part of one and tested it myself. The alignment wasn't yet implemented, but copying seemed to work:
Code: mov [linearCopy.copyAddress], rcx mov [linearCopy.copyDestAddress], rdx mov [linearCopy.copySize], r8 .copySet: mov rsi, [linearCopy.copyAddress] mov rdi, [linearCopy.copyDestAddress] cld mov rax, [linearCopy.copySize] mov [linearCopy.copySizeRemainder], rax shr rax, 101b ; Divide by 32 mov [linearCopy.copySize], rax mov rax, [linearCopy.copySize] mov rdx, 0 mov rcx, 100000b imul rcx mov r10, [linearCopy.copySizeRemainder] sub r10, rax mov [linearCopy.copySizeRemainder], r10 mov rax, [linearCopy.copySize] cmp rax, 0 je linearCopy.smallCopyOnly ;and rsp, -32;align 16 ; Not working yet .copyLarge: ; Copy in chunks of 4 qwords. AMD Optimization recommendation. Compare with rep movsq. mov r8, [rsi] mov r9, [rsi+1000b] add rsi, 100000b movnti [rdi], r8 movnti [rdi+1000b], r9 add rdi, 100000b mov r8, [rsi-10000b] mov r9, [rsi-1000b] dec rax movnti [rdi-10000b], r8 movnti [rdi-1000b], r9 jnz linearCopy.copyLarge .smallCopyOnly: mov rcx, [linearCopy.copySizeRemainder] rep movsb mov rax, [linearCopy.copyDestAddress] |
|||
14 Oct 2018, 18:53 |
|
MacroZ 14 Oct 2018, 20:24
Even if it is old people can still make use of it, it doesn't mean it stops there, although I was thinking more about the macro implementation itself, but thanks anyway, nice example to keep in mind.
|
|||
14 Oct 2018, 20:24 |
|
Furs 14 Oct 2018, 21:49
MacroZ wrote: Care to show me how you would do the macro prototype? In this case, I'd rather just specify the registers as a single parameter, each separated by space. It's the cleanest way to call this macro IMO. Something like this: Code: @err fix macro + macro _m_MemCopy CpySize*,Src*,Dst*,regs* { local loop32,check8,check4,check1,loop1,bye irp reg, bRbx,bRsi,bRdi,bR12,bR13,bR14,bR15 \{ local reg reg = 0 \} define reg irps r, regs \{ match =rbx, r \\{ define reg bRbx \\} match =rsi, r \\{ define reg bRsi \\} match =rdi, r \\{ define reg bRdi \\} match =r12, r \\{ define reg bR12 \\} match =r13, r \\{ define reg bR13 \\} match =r14, r \\{ define reg bR14 \\} match =r15, r \\{ define reg bR15 \\} match , reg \\{ @err "Bad register" \\} reg = 1 restore reg \} restore reg ; more stuff } Code: _m_MemCopy 1, 2, 3, rbx r13 r15 Just FYI, after preprocessing, this will look like: Code: ; the locals here would be replaced by some local auto-generated names due to our use of local without the \ (so it's part of macro, not irp) bRbx = 0 bRsi = 0 bRdi = 0 bR12 = 0 bR13 = 0 bR14 = 0 bR15 = 0 bRbx = 1 bR13 = 1 bR15 = 1 It's only slightly important because the assembly stage is multi-pass, so this will get "evaluated" multiple times for each pass if needed. |
|||
14 Oct 2018, 21:49 |
|
MacroZ 14 Oct 2018, 22:49
Something like that
I will try it out as soon as I get back to coding. I hate it when I can't get macro's as clean as that. |
|||
14 Oct 2018, 22:49 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.