flat assembler
Message board for the users of flat assembler.
Index
> Main > Modern x86 instructions |
Author |
|
DimonSoft 23 Jul 2019, 20:53
Start sizecoding. When you try to decrease the size you end up looking at the instruction set wondering which instruction might help you save a few more bytes thanks to its purpose or side effects.
P.S. String-handling instructions are there since 8086, so they’re not modern instructions. |
|||
23 Jul 2019, 20:53 |
|
rugxulo 28 Jul 2019, 02:20
Wikipedia wrote:
|
|||
28 Jul 2019, 02:20 |
|
rugxulo 29 Jul 2019, 13:02
|
|||
29 Jul 2019, 13:02 |
|
fpissarra 29 Jul 2019, 16:18
I suggest you take a look at what the compiler does. For instance:
Code: char *strcpy( char *dp, char *sp ) { char *p = dp; while ( *dp++ = *sp++ ); return p; } Compiling with GCC and creating the assembly listing, with good optimization for your processor: Code: $ gcc -O2 -mtune=native -masm=intel -S strcpy.c You'll end up with something like this: Code: strcpy: mov rax,rdi mov rdx,rsi .loop: add rsi,1 movzx ecx, byte [rsi-1] add rdx,1 mov byte [rdx - 1],cl test cl,cl jne .loop ret Sure, is a longer code then using REP MOVSB or LODSB/STOSB pair, but your compiler thinks this is the best code possible for YOUR processor (probably). Some instructions are slower nowadays (like XLAT, INC and DEC - hence the ADD above). It won't use THEM. |
|||
29 Jul 2019, 16:18 |
|
DimonSoft 29 Jul 2019, 18:21
fpissarra wrote: Sure, is a longer code then using REP MOVSB or LODSB/STOSB pair, but your compiler thinks this is the best code possible for YOUR processor (probably). Obligatory note from revolution that performance should be measured should have been placed here. Anyway I still insist REP MOVSB is more future-proof performance-wise as it expresses a more high-level concept and that information can be used by newer processors to do special optimizations that are hardly applicable to the code that says how to do something instead of saying what to do. Besides, AFAIK, MOVSB used to be faster, then slower for some generations of processors, but is still faster in modern CPUs. It should also be noted that a small two-byte REP MOVSB has much better chances to make prefetcher happy and is definitely more cache-friendly. |
|||
29 Jul 2019, 18:21 |
|
revolution 29 Jul 2019, 18:51
Thanks to DimonSoft for pointing out that code performance is system specific. Some systems execute code faster and others slower. And it isn't just which instructions you use, it is also where they appear in the code stream, and what state the caches are in, and the type of memory bus used, and etc. etc. etc.
Basically there are plenty of things that go into the final execution speed. Trying to guess which code works best, for which CPU, for which memory, for which cache state etc., is really really hard, if not impossible. IMO there is nothing wrong with rep movsb, or using the basic RISC type instructions either. Use whatever works best for your systems. Or just use whatever you understand the most. In many cases it won't matter which you use anyway. So don't worry about it too much unless you encounter some bottleneck that you need to improve. |
|||
29 Jul 2019, 18:51 |
|
fpissarra 29 Jul 2019, 23:12
I am not implying that the created C Compiler code is faster then handycrafted. As I wrote, PROBABLY, is (considering most wanna be programmers don't do the proper research to create a performatic code).
Yep... I agree REP MOVSB can be faster, but, in that particular example the source string size must be known, requiring to do a REPNZ SCASB and some arithmetic with (R|E)ECX, and so on... And I agree that perfomance must be measured since not only the latency and throughput of instructions are in play, but also cache effects, page faults, invalid TLB entries, etc... By my comparisons (measuring), this: Code: strcpy_: mov rdx,rdi mov rdi,rsi mov rcx,-1 xor al,al repne scasb neg rcx sub rcx,1 mov rdi,rdx rep movsb mov rax,rdx ret Is slower then the previous routine... I know rep/movsb can take advantage of copying entire cache lines in Haswell+ processors, but rep/scasb has no advantage as this... My comments are, in a sense, for @int0x50 NOT to worry about too much about lots "instructions", since good C compilers (which do good, but not AMAZING, job) do it. too... |
|||
29 Jul 2019, 23:12 |
|
bitRAKE 30 Jul 2019, 02:14
The compiler explorer is a fun tool to play with ...
https://gcc.godbolt.org/z/nCam1B Code: strcpy(char*, char*): # @strcpy(char*, char*) mov rax, rdi xor ecx, ecx .LBB0_1: # =>This Inner Loop Header: Depth=1 movzx edx, byte ptr [rsi + rcx] mov byte ptr [rax + rcx], dl add rcx, 1 test dl, dl jne .LBB0_1 ret It is wholly possible that the course of your work only needs a small subset of the instruction set. DimonSoft's recommendation of size coding is a good one - attempting to utilize every aspect of an instruction creates added familiarity, imho. Reading code produced by other's is also educational. Maybe, a combination of reading the size coding of others. _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup |
|||
30 Jul 2019, 02:14 |
|
Roman 30 Jul 2019, 08:51
Nice code bitRAKE.
I want show too this code. Code: ;in rax = string befor call strcpy ;in rsi = source string befor call strcpy strcopy: sub rcx, rcx .LBB0_1: # =>This Inner Loop Header: Depth=1 mov dl, byte ptr [rsi + rcx] mov byte ptr [rax + rcx], dl inc rcx test dl, dl jne .LBB0_1 ;now in rcx we get string size or how bytes copy. ret I not understood allway i read how cool optimize c++ compiler code. But in real world compiler optimization not good and perfect ! |
|||
30 Jul 2019, 08:51 |
|
bitRAKE 30 Jul 2019, 10:32
I just pasted from the compiler output. My version would go something like ...
Code: macro strzcpy @@: lodsb stosb test al,al jnz @B end macro _________________ ¯\(°_o)/¯ “languages are not safe - uses can be” Bjarne Stroustrup Last edited by bitRAKE on 30 Jul 2019, 10:39; edited 1 time in total |
|||
30 Jul 2019, 10:32 |
|
fpissarra 30 Jul 2019, 10:38
@bitRAKE, this is a nice macro, indeed!
|
|||
30 Jul 2019, 10:38 |
|
Roman 30 Jul 2019, 11:22
what about сld or std for lodsb ?
Last edited by Roman on 30 Jul 2019, 14:11; edited 1 time in total |
|||
30 Jul 2019, 11:22 |
|
bitRAKE 30 Jul 2019, 11:29
Most of the environments I work in default to CLD, but that is a good point.
I have always recommended, the Hugi Size Coding Competition Series for those wanting to challenge themselves with size. |
|||
30 Jul 2019, 11:29 |
|
st 31 Jul 2019, 09:02
There is one nuance about MOVS that one could miss: it moves data atomically. Destination data is always equal to sourced one.
Let me explain this with an example of an automatic data allocator (a bit long reading, one may skip #1). 1. There are few preallocated bytes at the end of an executable image: Code: lea RDI, [heap] stosq ; stosq ; heap rb INITIAL_SIZE Just put data to the address kept in RDI then increment RDI. Got an exception while writing to the uncommitted address after this heap? Just catch it and map another RAM page in the exception handler. This is simple but of course tends to waste all available RAM. Therefore needs some garbage collection. So, every block that is in use has a pointer to it. These pointers (so-called roots) are stored somewhere. Now exception handler scans pointers, finds alive blocks and moves them to the start of the heap (some kind of compacting GC). Then corrects RDI. AND corrects pointers to moved blocks. 2. Suppose some block containing volatile data (roots) is being copied to the current allocation space: Code: movsq movsq ; Out of heap exception (#PF) is generated. ; Source data is then changed by the exception handler. ; Then handler returns and failed instruction is re-executed. ; So modified source is moved to the destination. I hope you get some ideas. However, please consider this as an assumption. Refer to SDM, Vol.1, Section 7.3.9.3 “Fast-String Operation” and Vol.3A, Section 8.2.4 “Fast-String Operation and Out-of-Order Stores” for additional information. |
|||
31 Jul 2019, 09:02 |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.