flat assembler
Message board for the users of flat assembler.

flat assembler > Main > Modern x86 instructions

Author
Thread Post new topic Reply to topic
int0x50



Joined: 19 Jul 2019
Posts: 13
I am coming from the C programming background. I have started developing assembly language programming. I feel like I am using only 16 bit instructions and techniques and lacking to exploit instructions like LODSB, REPNE, CMPSB; etc. So even a simple logic like strcmp , I end up writing lot of instructions. What is the best way to overcome this problem?
Post 23 Jul 2019, 15:17
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 572
Location: Belarus
Start sizecoding. When you try to decrease the size you end up looking at the instruction set wondering which instruction might help you save a few more bytes thanks to its purpose or side effects.
P.S. String-handling instructions are there since 8086, so they’re not modern instructions.
Post 23 Jul 2019, 20:53
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
Wikipedia wrote:

SSE 4.2 added STTNI (String and Text New Instructions), several new instructions that perform character searches and comparison on two operands of 16 bytes at a time. These were designed (among other things) to speed up the parsing of XML documents.
Post 28 Jul 2019, 02:20
View user's profile Send private message Visit poster's website Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
It seems that SSE 4.2 has already been mentioned a few times here on this forum, e.g. here (Main, "explain how work pcmpestri", circa 2013). It refers to this third-party webpage (11 years ago, Peter Kankowski), which is very informative and also uses FASM!
Post 29 Jul 2019, 13:02
View user's profile Send private message Visit poster's website Reply with quote
fpissarra



Joined: 10 Apr 2019
Posts: 63
I suggest you take a look at what the compiler does. For instance:
Code:
char *strcpy( char *dp, char *sp )
{
  char *p = dp;
  while ( *dp++ = *sp++ );
  return p;
}    

Compiling with GCC and creating the assembly listing, with good optimization for your processor:
Code:
$ gcc -O2 -mtune=native -masm=intel -S strcpy.c    

You'll end up with something like this:
Code:
strcpy:
  mov rax,rdi
  mov rdx,rsi
.loop:
  add rsi,1
  movzx ecx, byte [rsi-1]
  add rdx,1
  mov byte [rdx - 1],cl
  test cl,cl
  jne .loop
  ret    

Sure, is a longer code then using REP MOVSB or LODSB/STOSB pair, but your compiler thinks this is the best code possible for YOUR processor (probably).

Some instructions are slower nowadays (like XLAT, INC and DEC - hence the ADD above). It won't use THEM.
Post 29 Jul 2019, 16:18
View user's profile Send private message Reply with quote
DimonSoft



Joined: 03 Mar 2010
Posts: 572
Location: Belarus
fpissarra wrote:
Sure, is a longer code then using REP MOVSB or LODSB/STOSB pair, but your compiler thinks this is the best code possible for YOUR processor (probably).

Some instructions are slower nowadays (like XLAT, INC and DEC - hence the ADD above). It won't use THEM.

Obligatory note from revolution that performance should be measured should have been placed here.

Anyway I still insist REP MOVSB is more future-proof performance-wise as it expresses a more high-level concept and that information can be used by newer processors to do special optimizations that are hardly applicable to the code that says how to do something instead of saying what to do.

Besides, AFAIK, MOVSB used to be faster, then slower for some generations of processors, but is still faster in modern CPUs. It should also be noted that a small two-byte REP MOVSB has much better chances to make prefetcher happy and is definitely more cache-friendly.
Post 29 Jul 2019, 18:21
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16778
Location: In your JS exploiting you and your system
Thanks to DimonSoft for pointing out that code performance is system specific. Some systems execute code faster and others slower. And it isn't just which instructions you use, it is also where they appear in the code stream, and what state the caches are in, and the type of memory bus used, and etc. etc. etc.

Basically there are plenty of things that go into the final execution speed. Trying to guess which code works best, for which CPU, for which memory, for which cache state etc., is really really hard, if not impossible.

IMO there is nothing wrong with rep movsb, or using the basic RISC type instructions either. Use whatever works best for your systems. Or just use whatever you understand the most. In many cases it won't matter which you use anyway. So don't worry about it too much unless you encounter some bottleneck that you need to improve.
Post 29 Jul 2019, 18:51
View user's profile Send private message Visit poster's website Reply with quote
fpissarra



Joined: 10 Apr 2019
Posts: 63
I am not implying that the created C Compiler code is faster then handycrafted. As I wrote, PROBABLY, is (considering most wanna be programmers don't do the proper research to create a performatic code).

Yep... I agree REP MOVSB can be faster, but, in that particular example the source string size must be known, requiring to do a REPNZ SCASB and some arithmetic with (R|E)ECX, and so on... And I agree that perfomance must be measured since not only the latency and throughput of instructions are in play, but also cache effects, page faults, invalid TLB entries, etc...

By my comparisons (measuring), this:
Code:
strcpy_:
  mov   rdx,rdi
  mov   rdi,rsi
  mov   rcx,-1
  xor   al,al
  repne scasb
  neg   rcx
  sub   rcx,1
  mov   rdi,rdx
  rep   movsb
  mov   rax,rdx
  ret
    

Is slower then the previous routine... I know rep/movsb can take advantage of copying entire cache lines in Haswell+ processors, but rep/scasb has no advantage as this...

My comments are, in a sense, for @int0x50 NOT to worry about too much about lots "instructions", since good C compilers (which do good, but not AMAZING, job) do it. too...
Post 29 Jul 2019, 23:12
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2795
Location: dank orb
The compiler explorer is a fun tool to play with ...
https://gcc.godbolt.org/z/nCam1B
Code:
strcpy(char*, char*):                          # @strcpy(char*, char*)
        mov     rax, rdi
        xor     ecx, ecx
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        movzx   edx, byte ptr [rsi + rcx]
        mov     byte ptr [rax + rcx], dl
        add     rcx, 1
        test    dl, dl
        jne     .LBB0_1
        ret    
int0x50, When we communicate only a small subset of words are used to convey most ideas. Being aware of other words happens over time. Integration of those words into one's parlance takes more time. Neither of these things can be expedited nor are they a problem per se - merely the study and utilization are sufficient.

It is wholly possible that the course of your work only needs a small subset of the instruction set. DimonSoft's recommendation of size coding is a good one - attempting to utilize every aspect of an instruction creates added familiarity, imho. Reading code produced by other's is also educational. Maybe, a combination of reading the size coding of others.

_________________
¯\(°_o)/¯ unlicense.org
Post 30 Jul 2019, 02:14
View user's profile Send private message Visit poster's website Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 503
Nice code bitRAKE.
I want show too this code.
Code:
;in rax = string befor call strcpy
;in rsi = source string befor call strcpy
strcopy:                                  
        sub     rcx, rcx
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        mov     dl, byte ptr [rsi + rcx]
        mov     byte ptr [rax + rcx], dl
        inc     rcx
        test    dl, dl
        jne     .LBB0_1
;now in rcx we get string size or how bytes copy.
        ret 
    


I not understood allway i read how cool optimize c++ compiler code.
But in real world compiler optimization not good and perfect !
Post 30 Jul 2019, 08:51
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2795
Location: dank orb
I just pasted from the compiler output. My version would go something like ...
Code:
macro strzcpy
  @@:   lodsb
        stosb
        test al,al
        jnz @B
end macro    
.,. certainly depends on what one is doing. Do we need to support some external interface (execution environment of code, or standards of a working group)?

_________________
¯\(°_o)/¯ unlicense.org


Last edited by bitRAKE on 30 Jul 2019, 10:39; edited 1 time in total
Post 30 Jul 2019, 10:32
View user's profile Send private message Visit poster's website Reply with quote
fpissarra



Joined: 10 Apr 2019
Posts: 63
@bitRAKE, this is a nice macro, indeed!
Post 30 Jul 2019, 10:38
View user's profile Send private message Reply with quote
Roman



Joined: 21 Apr 2012
Posts: 503
what about сld or std for lodsb ?


Last edited by Roman on 30 Jul 2019, 14:11; edited 1 time in total
Post 30 Jul 2019, 11:22
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 2795
Location: dank orb
Most of the environments I work in default to CLD, but that is a good point.

I have always recommended, the Hugi Size Coding Competition Series for those wanting to challenge themselves with size.
Post 30 Jul 2019, 11:29
View user's profile Send private message Visit poster's website Reply with quote
st



Joined: 12 Jul 2019
Posts: 26
Location: Russia
There is one nuance about MOVS that one could miss: it moves data atomically. Destination data is always equal to sourced one.

Let me explain this with an example of an automatic data allocator (a bit long reading, one may skip #1).

1. There are few preallocated bytes at the end of an executable image:
Code:
lea RDI, [heap]
stosq
;
stosq
;
heap rb INITIAL_SIZE    

Just put data to the address kept in RDI then increment RDI. Got an exception while writing to the uncommitted address after this heap? Just catch it and map another RAM page in the exception handler.

This is simple but of course tends to waste all available RAM. Therefore needs some garbage collection.

So, every block that is in use has a pointer to it. These pointers (so-called roots) are stored somewhere. Now exception handler scans pointers, finds alive blocks and moves them to the start of the heap (some kind of compacting GC). Then corrects RDI.
AND corrects pointers to moved blocks.

2. Suppose some block containing volatile data (roots) is being copied to the current allocation space:
Code:
movsq
movsq
; Out of heap exception (#PF) is generated.
; Source data is then changed by the exception handler.
; Then handler returns and failed instruction is re-executed.
; So modified source is moved to the destination.
    
It is not so easy with 'RISC' approach to re-read source.

I hope you get some ideas. However, please consider this as an assumption. Refer to SDM, Vol.1, Section 7.3.9.3 “Fast-String Operation” and Vol.3A, Section 8.2.4 “Fast-String Operation and Out-of-Order Stores” for additional information.
Post 31 Jul 2019, 09:02
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2019, Tomasz Grysztar.

Powered by rwasa.