flat assembler
Message board for the users of flat assembler.

flat assembler > Main > FASM vs C++ Optimization

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
marywilliam



Joined: 07 Jun 2018
Posts: 10
So I ran a simple test to measure the processor cycles of a C++ function versus it's inline FASM equivalent. To my surprise, the FASM version was about 7 times faster. Are current compilers still slower? Or maybe there's something wrong in my code or I am misunderstanding the results. Could someone run this and verify the results?

I am adding two arrays of length 2000, and I am doing that operation 100000 times. This is my code:

C++

void calcuC(int *x,int *y,int length)
{
for(int i = 0; i < TIMES; i++)
{
for(int j = 0; j < length; j++)
x[j] += y[j];
}
}

fasm
Code:
;void calcuAsm(int x[],int y[],int length);
public calcuAsm

calcuAsm:
   mov ebx,100000
start:
    mov ecx,edx ;2000=length of array
    shr ecx,1
    lea r8, [esi]
    lea r9, [edi]
@@:
    movq mm0,[r8]
    paddd mm0,[r9]
    add r9,8
    movq [r8],mm0   
    add r8,8
    dec ecx
    test ecx,ecx
    jnz @b
    dec ebx
    test ebx,ebx
    jnz start
    


For my fasm function, I made an object file and linked it with my C++ project. That allows me to call both the native C++ and FASM functions within my C++ project. My current work is in C++, so I wanted to test how well a fasm implemented function would perform.

The execution results in microseconds are:
FASM
91709
92247
93757
93253
90603
Average = 92313.8

C++
635412
623639
639946
643908
626366
Average = 633854.2

I have attached all the source files and the Makefile. Are these results typical? Looking for the experts to chime in.
Thanks.


Description:
Download
Filename: asmtest.zip
Filesize: 3.9 KB
Downloaded: 64 Time(s)



Last edited by marywilliam on 22 Aug 2019, 23:33; edited 1 time in total
Post 22 Aug 2019, 23:06
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16779
Location: In your JS exploiting you and your system
What is the calling convention used?

[edit]Now that the forum software is fixed the comment that was here makes no sense anymore.


Last edited by revolution on 23 Aug 2019, 08:41; edited 1 time in total
Post 22 Aug 2019, 23:22
View user's profile Send private message Visit poster's website Reply with quote
marywilliam



Joined: 07 Jun 2018
Posts: 10
I am using x86-64 GCC calling convention (ABI 0.98 ). The code tags made the C++ code above look weird. I have edited my original post. The source files are clearer to understand. You can see how I call both my C++ and fasm functions and find their execution times.
Post 22 Aug 2019, 23:34
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16779
Location: In your JS exploiting you and your system
Oh, I see that the forum is broken. It is eating the square brackets in code sections.
Post 23 Aug 2019, 00:35
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16779
Location: In your JS exploiting you and your system
One thing you can compare is the disassembled output of the C function. Or if your compiler has the option, look at the pre-assembled output (preferably in Intel format (not AT&T format)).
Post 23 Aug 2019, 00:49
View user's profile Send private message Visit poster's website Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 70
This is inner loop from clang:
Code:
.LBB0_5:                                #   Parent Loop BB0_2 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
    vmovdqu (%rdi,%rcx,4), %ymm0
    vmovdqu 32(%rdi,%rcx,4), %ymm1
    vmovdqu 64(%rdi,%rcx,4), %ymm2
    vmovdqu 96(%rdi,%rcx,4), %ymm3
    vpaddd  (%rsi,%rcx,4), %ymm0, %ymm0
    vpaddd  32(%rsi,%rcx,4), %ymm1, %ymm1
    vpaddd  64(%rsi,%rcx,4), %ymm2, %ymm2
    vpaddd  96(%rsi,%rcx,4), %ymm3, %ymm3
    vmovdqu %ymm0, (%rdi,%rcx,4)
    vmovdqu %ymm1, 32(%rdi,%rcx,4)
    vmovdqu %ymm2, 64(%rdi,%rcx,4)
    vmovdqu %ymm3, 96(%rdi,%rcx,4)
    addq    $32, %rcx
    cmpq    %rcx, %r10
    jne .LBB0_5
    


this is gcc:
Code:
.L4:
    vmovups (%rax,%rdx), %xmm1
    vpaddd  (%rcx,%rdx), %xmm1, %xmm0
    vmovups %xmm0, (%rax,%rdx)
    addq    $16, %rdx
    cmpq    %rsi, %rdx
    jne .L4
    

AMD Zen+

edit code section eats ()
Post 23 Aug 2019, 01:10
View user's profile Send private message Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 70
btw in your makefile, you are compiling without optimisations, no wonder C code is slow ; )
Post 23 Aug 2019, 01:22
View user's profile Send private message Reply with quote
marywilliam



Joined: 07 Jun 2018
Posts: 10
Melissa wrote:
btw in your makefile, you are compiling without optimisations, no wonder C code is slow ; )


What type of performance do you get when you optimize the C code? I'm still surprised to see such differences in the standard compilation.
Post 23 Aug 2019, 01:37
View user's profile Send private message Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 70
there is no standard compilation. Without -O1-3 or -Ofast compiler produces code usefull only for debugging algorithm.

edit:
also you would use -march=native as without it it will only produce generic x86 code
Post 23 Aug 2019, 01:50
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1430
It is always funny and sad at the same time, to see people compile without optimizations when caring about the output (performance, size, whatever).

Seriously, why is this such a common thing? stackoverflow is full of this kind of "experiences".
Post 24 Aug 2019, 13:19
View user's profile Send private message Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 70
Also ams routine is not correct. It shuld preserve rbx, use 64 bit registers for addresses and ret is missing.
Post 06 Sep 2019, 05:41
View user's profile Send private message Reply with quote
guignol



Joined: 06 Dec 2008
Posts: 619
Location: I mean good, I know better, only us will do best
there is no need to argue about compilers, unless you have their source codes

and tbe effectiveness of the compiler depends on the complexity of the language itself
this is obvious


and C is a classic, which is the best compiler?
Post 06 Sep 2019, 07:17
View user's profile Send private message Reply with quote
fpissarra



Joined: 10 Apr 2019
Posts: 63
Counting cycles, with -O2 optimizations, not using MMX/SSE on ASM block:
Code:
/* test.c */
/* Compile:

   $ gcc -O2 -mtune=native -fno-stack-protector -o test test.c

   Notice I used -mtune just to optimize for my processor,
   but using 'generic' instruction set. The -fno-stack-protector
   is there just to avoid the compiler inserting unecessary code.

   You can try to change -O2 to -O3 and use -ftree-vectorize to use SSE, fully, in C.

   Of couse, use -S if you want to see the asm listing.
*/

#include <stdio.h>
#include <string.h>
#include <inttypes.h>

/* Flush all read/writes before cycle counting... */
#define SYNC_MEM

#include "cycle_counting.h"

/* Functions will be called, not inlined */
#define NOINLINE __attribute__((noinline))

/* I'm not using dynamic allocation */
#define ARRAY_SIZE  20000

NOINLINE static void calcAsm( int *, int *, size_t );
NOINLINE static void calcC( int *, int *, size_t );
static void fillArrays( int *, int *, size_t );

int main( void )
{
  /* static, because I don't want to allocate on stack! */
  static int x[ARRAY_SIZE], y[ARRAY_SIZE];
  counter_T c1, c2;

  fillArrays( x, y, sizeof x );
  c1 = BEGIN_TSC();
    calcC( x, y, ARRAY_SIZE );
  END_TSC( &c1 );

  fillArrays( x, y, sizeof x );
  c2 = BEGIN_TSC();
    calcAsm( x, y, ARRAY_SIZE );
  END_TSC( &c2 );

  printf( "Cycles:\n"
          "  calcAsm: %" PRIu64 "\n"
          "  calcC:   %" PRIu64 "\n",
          c2, c1 );
}

/* Filling the arrays the same way you do. */
void fillArrays( int *px, int *py, size_t size )
{
  size_t elements = size / sizeof * py;

  memset( px, 0, size );

  while ( elements-- )
    *py++ = 1;
}

void calcC( int *px, int *py, size_t elements )
{
  while ( elements-- )
    *px++ += *py++;
}

/* Sorry... I don't like fasm, so here's the inline assembly function. */
void calcAsm( int *px, int *py, size_t elements )
{
  /* This is the same thing you did, in asm... almost. 
     Without using SSE. */
  __asm__ __volatile__(
    "   testq %%rdx,%%rdx\n\t"
    "   jz    2f\n\t"
    "1:\n\t"
    "   movl  (%%rdi),%%eax\n\t"
    "   addl  (%%rsi),%%eax\n\t"
    "   movl  %%eax,(%%rdi)\n\t"
    "   addq  $4,%%rdi\n\t"
    "   addq  $4,%%rsi\n\t"
    "   subq  $1,%%rdx\n\t"
    "   jnz   1b\n\t"
    "2:\n"
    :
    : "D"( px ), "S"( py ), "d"( elements )
    : "memory"
  );
}    


My cycles_counting.h header:
Code:
#ifndef CYCLE_COUNTING_INCLUDED__
#define CYCLE_COUNTING_INCLUDED__

/* FIXME: Add CLANG test! */
#ifndef __GNUC__
# error Works only on GCC
#endif

/* ==========================================
    Quick & Dirty cycle counting...
   ========================================== */
#include <stdint.h>

typedef volatile uint64_t counter_T;

#if defined(__x86_64__)
  inline counter_T BEGIN_TSC( void )
  {
    uint32_t a, d;

    __asm__ __volatile__ (
# ifdef SYNC_MEM
      "mfence\n\t"
# endif
      "xorl %%eax,%%eax\n\t"
      "cpuid\n\t"
      "rdtsc" : "=a" (a), "=d" (d) :: "%rbx", "%rcx"
    );

    return a | ((uint64_t)d << 32);
  }

  inline void END_TSC( counter_T *cptr )
  {
    uint32_t a, d;

    __asm__ __volatile__ (
      "rdtscp" : "=a" (a), "=d" (d) :: "%rcx"
    );

    *cptr = (a | ((uint64_t)d << 32)) - *cptr;;
  }
#elif defined(__i386__)
  inline counter_T BEGIN_TSC( void )
  {
    uint32_t a, d;

    __asm__ __volatile__ (
# ifdef __SSE2__
#   ifdef SYNC_MEM
      "mfence\n\t"
#   endif
# endif
      "xorl %%eax,%%eax\n\t"
      "cpuid\n\t"
      "rdtsc" : "=a" (a), "=d" (d) :: "%ebx", "%ecx"
    );

    return a | ((uint64_t)d << 32);
  }

  inline void END_TSC( counter_T *cptr )
  {
    uint32_t a, d;

    __asm__ __volatile__ (
      "rdtscp" : "=a" (a), "=d" (d) :: "%ecx"
    );

    *cptr = (a | ((uint64_t)d << 32)) - *cptr;
  }
#elif defined(__aarch64__)
  inline counter_T BEGIN_TSC( void )
  {
    uint64_t count;

    __asm__ __volatile__ ( 
# ifdef SYNC_MEM
    "dmb\n\t"
# endif
    "mrs %0,cntvct_el0" : "=r" (count) );

    return count;
  }

  inline void END_TSC( counter_T *cptr )
  {
    uint64_t count;

    __asm__ __volatile__ ( "mrs %0,cntvct_el0" : "=r" (count) );

    *cptr = count - *cptr;
  }
#else
# error i386, x86-64 and AArch64 only.
#endif

#endif    


Here's the test:

Code:
$ gcc -O2 -mtune=native -o test test.c
$ ./test
Cycles:
  calcAsm: 344207
  calcC:   58843    

Yep... calcAsm is 6 times SLOWER than the calcC![/code]
Post 06 Sep 2019, 09:43
View user's profile Send private message Reply with quote
st



Joined: 12 Jul 2019
Posts: 26
Location: Russia
fpissarra wrote:

Code:
  /* static, because I don't want to allocate on stack! */
  static int x[ARRAY_SIZE], y[ARRAY_SIZE];
    
I guess GCC did all calculation in the compile-time.
Post 06 Sep 2019, 09:53
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16779
Location: In your JS exploiting you and your system
fpissarra wrote:
Yep... calcAsm is 6 times SLOWER than the calcC!
Certainly possible.

But you might want to try inverting the call order. The array might need to "warm up" the cache or something, and that time would be included in the timing for the first called function.

Also a system interrupt in the middle of one function call can affect the times you see. Try running the test many times over for each call, and display the shortest time.
Post 06 Sep 2019, 10:02
View user's profile Send private message Visit poster's website Reply with quote
fpissarra



Joined: 10 Apr 2019
Posts: 63
revolution wrote:
But you might want to try inverting the call order. The array might need to "warm up" the cache or something, and that time would be included in the timing for the first called function.

That's why I use mfence and cpuid serialization... But, you are right testing both ways... inverting the calls gave me this:
Code:
$ ./test
Cycles:
  calcAsm: 64168
  calcC:   56762    

Still... calcAsm is 13% slower than calcC.

revolution wrote:
Also a system interrupt in the middle of one function call can affect the times you see. Try running the test many times over for each call, and display the shortest time.

Not only interrupts, but task switching, DMA, page faults, etc... This is always a problem.
Post 06 Sep 2019, 10:24
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 16779
Location: In your JS exploiting you and your system
fpissarra wrote:
Not only interrupts, but task switching, DMA, page faults, etc... This is always a problem.
Yeah. Programs don't run in isolation. They run on a system with many other things happening.
Post 06 Sep 2019, 10:50
View user's profile Send private message Visit poster's website Reply with quote
fpissarra



Joined: 10 Apr 2019
Posts: 63
To be fair, since the original code uses MMX, here's a good test using SSE, in both C and ASM functions:
Code:
#include <stddef.h>
#include <x86intrin.h>

/* Using SSE */
void calcC ( int *px, int *py, size_t elements )
{
  size_t blocks = elements >> 2;    // 4 integers blocks counter.
  size_t remaining = elements & 3;  // if elements isn't multiple of 4.
  __m128i *p1, *p2;

  p1 = (__m128i *)px;
  p2 = (__m128i *)py;
  while ( blocks-- )
  {
    /* Explicitly using intrinsic function.
       These 2 lines could be changed to a single one:

       *p1++ += *p2++; 

       But I want to be sure if 'int's are being modified!
    */
    *p1 = _mm_add_epi32( *p1, *p2 );
    p1++; p2++;
  }

  /* Change remaining integers, if any */
  px = (int *)p1;
  py = (int *)p2;
  while ( remaining-- )
    *px++ += *py++;
}

/* The same routine, handmade, using SSE. */
void calcAsm ( int *px, int *py, size_t elements )
{
  __asm__ __volatile__ (
      "  movq  %%rdx,%%r8\n\t"
      "  shrq  $2,%%r8\n\t"     /* R8 = # of 4 integers blocks... */
      "  andl  $3,%%edx\n\t"    /* EDX = remaining integers */

      "  testq %%r8,%%r8\n\t"
      "  jz    2f\n\t"
      "1:\n\t"
      "  movdqa (%%rdi),%%xmm0\n\t"
      "  paddd  (%%rsi),%%xmm0\n\t"
      "  movaps %%xmm0,(%%rdi)\n\t"
      "  addq   $16,%%rdi\n\t"
      "  addq   $16,%%rsi\n\t"
      "  subq   $1,%%r8\n\t"
      "  jnz    1b\n\t"
      "2:\n\t"
      "  testl  %%edx,%%edx\n\t"  /* If there are still integers on array... */
      "  jz     4f\n\t"
      "3:\n\t"
      "  movl   (%%rdi),%%eax\n\t"
      "  addl   (%%rsi),%%eax\n\t"
      "  movl   %%eax,(%%rdi)\n\t"
      "  addq   $4,%%rdi\n\t"
      "  addq   $4,%%rsi\n\t"
      "  subl   $1,%%edx\n\t"
      "  jnz    3b\n\t"
      "4:\n\t"
    :
    : "D" (px), "S" (py), "d" (elements)
    : "r8", "memory"
  );
}    

My test, this time using an i5-3570:
Code:
$ ./test
Cycles:
  calcAsm: 38064
  calcC:   31628    

calcAsm is, still, slower (20% slower!)...

My point: It doesn't mean a routine is created in assembly that it is, automatically, faster than what a good routine, and a good compiler, can create... In the original test the comparison was between a non optimized C function and a good asm routine (not fair!)...

As @revolution points out. There are lots of other things to consider when measuring performance (and creating an awesome routine that takes into account things as caches usage, macro and microfusion of instructions, etc, etc, etc...)...
Post 06 Sep 2019, 12:20
View user's profile Send private message Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 70
Code:
~/.../examples/test >>> g++ -Ofast -std=c++17 -march=native test.cpp  -o testcpp -pthread -lstdc++           
~/.../examples/test >>> ./testcpp                                                                            
Cycles:
  calcAsm: 19872
  calcC:   13176
~/.../examples/test >>> g++ -O2 -std=c++17 -march=native test.cpp  -o testcpp -pthread -lstdc++              
~/.../examples/test >>> ./testcpp                                                                            
Cycles:
  calcAsm: 20052
  calcC:   53496
~/.../examples/test >>> clang -O2 -std=c++17 -march=native test.cpp  -o testcpp -pthread -lstdc++            
~/.../examples/test >>> ./testcpp                                                                            
Cycles:
  calcAsm: 14616
  calcC:   11808
~/.../examples/test >>> clang -Ofast -std=c++17 -march=native test.cpp  -o testcpp -pthread -lstdc++         
~/.../examples/test >>> ./testcpp                                                                            
Cycles:
  calcAsm: 14220
  calcC:   11844
    


clang uses inline asm differently.
Post 06 Sep 2019, 18:20
View user's profile Send private message Reply with quote
Furs



Joined: 04 Mar 2016
Posts: 1430
Here's what godbolt spit out:
Code:
calcC(int*, int*, unsigned long):
        mov     r9, rdx
        mov     rcx, rsi
        and     edx, 3
        mov     rax, rdi
        shr     r9, 2
        lea     r8, [r9-1]
        test    r9, r9
        je      .L5
.L2:
        movdqa  xmm0, XMMWORD PTR [rax]
        paddd   xmm0, XMMWORD PTR [rcx]
        sub     r8, 1
        add     rax, 16
        add     rcx, 16
        movaps  XMMWORD PTR [rax-16], xmm0
        cmp     r8, -1
        jne     .L2
        sal     r9, 4
        add     rdi, r9
        add     rsi, r9
.L5:
        test    rdx, rdx
        je      .L16
        mov     eax, DWORD PTR [rsi]
        add     DWORD PTR [rdi], eax
        cmp     rdx, 1
        je      .L1
        mov     eax, DWORD PTR [rsi+4]
        add     DWORD PTR [rdi+4], eax
        cmp     rdx, 2
        je      .L1
        mov     eax, DWORD PTR [rsi+8]
        add     DWORD PTR [rdi+8], eax
.L1:
        ret
.L16:
        ret    
Which seems almost identical to your asm. Something is definitely fishy here.
Post 08 Sep 2019, 15:02
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2019, Tomasz Grysztar.

Powered by rwasa.