flat assembler
Message board for the users of flat assembler.
Index
> Main > FASM vs C++ Optimization Goto page 1, 2 Next |
Author |
|
marywilliam 22 Aug 2019, 23:06
So I ran a simple test to measure the processor cycles of a C++ function versus it's inline FASM equivalent. To my surprise, the FASM version was about 7 times faster. Are current compilers still slower? Or maybe there's something wrong in my code or I am misunderstanding the results. Could someone run this and verify the results?
I am adding two arrays of length 2000, and I am doing that operation 100000 times. This is my code: C++ void calcuC(int *x,int *y,int length) { for(int i = 0; i < TIMES; i++) { for(int j = 0; j < length; j++) x[j] += y[j]; } } fasm Code: ;void calcuAsm(int x[],int y[],int length); public calcuAsm calcuAsm: mov ebx,100000 start: mov ecx,edx ;2000=length of array shr ecx,1 lea r8, [esi] lea r9, [edi] @@: movq mm0,[r8] paddd mm0,[r9] add r9,8 movq [r8],mm0 add r8,8 dec ecx test ecx,ecx jnz @b dec ebx test ebx,ebx jnz start For my fasm function, I made an object file and linked it with my C++ project. That allows me to call both the native C++ and FASM functions within my C++ project. My current work is in C++, so I wanted to test how well a fasm implemented function would perform. The execution results in microseconds are: FASM 91709 92247 93757 93253 90603 Average = 92313.8 C++ 635412 623639 639946 643908 626366 Average = 633854.2 I have attached all the source files and the Makefile. Are these results typical? Looking for the experts to chime in. Thanks.
Last edited by marywilliam on 22 Aug 2019, 23:33; edited 1 time in total |
|||||||||||
22 Aug 2019, 23:06 |
|
marywilliam 22 Aug 2019, 23:34
I am using x86-64 GCC calling convention (ABI 0.98 ). The code tags made the C++ code above look weird. I have edited my original post. The source files are clearer to understand. You can see how I call both my C++ and fasm functions and find their execution times.
|
|||
22 Aug 2019, 23:34 |
|
revolution 23 Aug 2019, 00:35
Oh, I see that the forum is broken. It is eating the square brackets in code sections.
|
|||
23 Aug 2019, 00:35 |
|
revolution 23 Aug 2019, 00:49
One thing you can compare is the disassembled output of the C function. Or if your compiler has the option, look at the pre-assembled output (preferably in Intel format (not AT&T format)).
|
|||
23 Aug 2019, 00:49 |
|
Melissa 23 Aug 2019, 01:10
This is inner loop from clang:
Code: .LBB0_5: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 vmovdqu (%rdi,%rcx,4), %ymm0 vmovdqu 32(%rdi,%rcx,4), %ymm1 vmovdqu 64(%rdi,%rcx,4), %ymm2 vmovdqu 96(%rdi,%rcx,4), %ymm3 vpaddd (%rsi,%rcx,4), %ymm0, %ymm0 vpaddd 32(%rsi,%rcx,4), %ymm1, %ymm1 vpaddd 64(%rsi,%rcx,4), %ymm2, %ymm2 vpaddd 96(%rsi,%rcx,4), %ymm3, %ymm3 vmovdqu %ymm0, (%rdi,%rcx,4) vmovdqu %ymm1, 32(%rdi,%rcx,4) vmovdqu %ymm2, 64(%rdi,%rcx,4) vmovdqu %ymm3, 96(%rdi,%rcx,4) addq $32, %rcx cmpq %rcx, %r10 jne .LBB0_5 this is gcc: Code: .L4: vmovups (%rax,%rdx), %xmm1 vpaddd (%rcx,%rdx), %xmm1, %xmm0 vmovups %xmm0, (%rax,%rdx) addq $16, %rdx cmpq %rsi, %rdx jne .L4 AMD Zen+ edit code section eats () |
|||
23 Aug 2019, 01:10 |
|
Melissa 23 Aug 2019, 01:22
btw in your makefile, you are compiling without optimisations, no wonder C code is slow ; )
|
|||
23 Aug 2019, 01:22 |
|
marywilliam 23 Aug 2019, 01:37
Melissa wrote: btw in your makefile, you are compiling without optimisations, no wonder C code is slow ; ) What type of performance do you get when you optimize the C code? I'm still surprised to see such differences in the standard compilation. |
|||
23 Aug 2019, 01:37 |
|
Melissa 23 Aug 2019, 01:50
there is no standard compilation. Without -O1-3 or -Ofast compiler produces code usefull only for debugging algorithm.
edit: also you would use -march=native as without it it will only produce generic x86 code |
|||
23 Aug 2019, 01:50 |
|
Furs 24 Aug 2019, 13:19
It is always funny and sad at the same time, to see people compile without optimizations when caring about the output (performance, size, whatever).
Seriously, why is this such a common thing? stackoverflow is full of this kind of "experiences". |
|||
24 Aug 2019, 13:19 |
|
Melissa 06 Sep 2019, 05:41
Also ams routine is not correct. It shuld preserve rbx, use 64 bit registers for addresses and ret is missing.
|
|||
06 Sep 2019, 05:41 |
|
guignol 06 Sep 2019, 07:17
there is no need to argue about compilers, unless you have their source codes
and tbe effectiveness of the compiler depends on the complexity of the language itself this is obvious and C is a classic, which is the best compiler? |
|||
06 Sep 2019, 07:17 |
|
fpissarra 06 Sep 2019, 09:43
Counting cycles, with -O2 optimizations, not using MMX/SSE on ASM block:
Code: /* test.c */ /* Compile: $ gcc -O2 -mtune=native -fno-stack-protector -o test test.c Notice I used -mtune just to optimize for my processor, but using 'generic' instruction set. The -fno-stack-protector is there just to avoid the compiler inserting unecessary code. You can try to change -O2 to -O3 and use -ftree-vectorize to use SSE, fully, in C. Of couse, use -S if you want to see the asm listing. */ #include <stdio.h> #include <string.h> #include <inttypes.h> /* Flush all read/writes before cycle counting... */ #define SYNC_MEM #include "cycle_counting.h" /* Functions will be called, not inlined */ #define NOINLINE __attribute__((noinline)) /* I'm not using dynamic allocation */ #define ARRAY_SIZE 20000 NOINLINE static void calcAsm( int *, int *, size_t ); NOINLINE static void calcC( int *, int *, size_t ); static void fillArrays( int *, int *, size_t ); int main( void ) { /* static, because I don't want to allocate on stack! */ static int x[ARRAY_SIZE], y[ARRAY_SIZE]; counter_T c1, c2; fillArrays( x, y, sizeof x ); c1 = BEGIN_TSC(); calcC( x, y, ARRAY_SIZE ); END_TSC( &c1 ); fillArrays( x, y, sizeof x ); c2 = BEGIN_TSC(); calcAsm( x, y, ARRAY_SIZE ); END_TSC( &c2 ); printf( "Cycles:\n" " calcAsm: %" PRIu64 "\n" " calcC: %" PRIu64 "\n", c2, c1 ); } /* Filling the arrays the same way you do. */ void fillArrays( int *px, int *py, size_t size ) { size_t elements = size / sizeof * py; memset( px, 0, size ); while ( elements-- ) *py++ = 1; } void calcC( int *px, int *py, size_t elements ) { while ( elements-- ) *px++ += *py++; } /* Sorry... I don't like fasm, so here's the inline assembly function. */ void calcAsm( int *px, int *py, size_t elements ) { /* This is the same thing you did, in asm... almost. Without using SSE. */ __asm__ __volatile__( " testq %%rdx,%%rdx\n\t" " jz 2f\n\t" "1:\n\t" " movl (%%rdi),%%eax\n\t" " addl (%%rsi),%%eax\n\t" " movl %%eax,(%%rdi)\n\t" " addq $4,%%rdi\n\t" " addq $4,%%rsi\n\t" " subq $1,%%rdx\n\t" " jnz 1b\n\t" "2:\n" : : "D"( px ), "S"( py ), "d"( elements ) : "memory" ); } My cycles_counting.h header: Code: #ifndef CYCLE_COUNTING_INCLUDED__ #define CYCLE_COUNTING_INCLUDED__ /* FIXME: Add CLANG test! */ #ifndef __GNUC__ # error Works only on GCC #endif /* ========================================== Quick & Dirty cycle counting... ========================================== */ #include <stdint.h> typedef volatile uint64_t counter_T; #if defined(__x86_64__) inline counter_T BEGIN_TSC( void ) { uint32_t a, d; __asm__ __volatile__ ( # ifdef SYNC_MEM "mfence\n\t" # endif "xorl %%eax,%%eax\n\t" "cpuid\n\t" "rdtsc" : "=a" (a), "=d" (d) :: "%rbx", "%rcx" ); return a | ((uint64_t)d << 32); } inline void END_TSC( counter_T *cptr ) { uint32_t a, d; __asm__ __volatile__ ( "rdtscp" : "=a" (a), "=d" (d) :: "%rcx" ); *cptr = (a | ((uint64_t)d << 32)) - *cptr;; } #elif defined(__i386__) inline counter_T BEGIN_TSC( void ) { uint32_t a, d; __asm__ __volatile__ ( # ifdef __SSE2__ # ifdef SYNC_MEM "mfence\n\t" # endif # endif "xorl %%eax,%%eax\n\t" "cpuid\n\t" "rdtsc" : "=a" (a), "=d" (d) :: "%ebx", "%ecx" ); return a | ((uint64_t)d << 32); } inline void END_TSC( counter_T *cptr ) { uint32_t a, d; __asm__ __volatile__ ( "rdtscp" : "=a" (a), "=d" (d) :: "%ecx" ); *cptr = (a | ((uint64_t)d << 32)) - *cptr; } #elif defined(__aarch64__) inline counter_T BEGIN_TSC( void ) { uint64_t count; __asm__ __volatile__ ( # ifdef SYNC_MEM "dmb\n\t" # endif "mrs %0,cntvct_el0" : "=r" (count) ); return count; } inline void END_TSC( counter_T *cptr ) { uint64_t count; __asm__ __volatile__ ( "mrs %0,cntvct_el0" : "=r" (count) ); *cptr = count - *cptr; } #else # error i386, x86-64 and AArch64 only. #endif #endif Here's the test: Code: $ gcc -O2 -mtune=native -o test test.c $ ./test Cycles: calcAsm: 344207 calcC: 58843 Yep... calcAsm is 6 times SLOWER than the calcC![/code] |
|||
06 Sep 2019, 09:43 |
|
st 06 Sep 2019, 09:53
fpissarra wrote:
|
|||
06 Sep 2019, 09:53 |
|
revolution 06 Sep 2019, 10:02
fpissarra wrote: Yep... calcAsm is 6 times SLOWER than the calcC! But you might want to try inverting the call order. The array might need to "warm up" the cache or something, and that time would be included in the timing for the first called function. Also a system interrupt in the middle of one function call can affect the times you see. Try running the test many times over for each call, and display the shortest time. |
|||
06 Sep 2019, 10:02 |
|
fpissarra 06 Sep 2019, 10:24
revolution wrote: But you might want to try inverting the call order. The array might need to "warm up" the cache or something, and that time would be included in the timing for the first called function. That's why I use mfence and cpuid serialization... But, you are right testing both ways... inverting the calls gave me this: Code: $ ./test Cycles: calcAsm: 64168 calcC: 56762 Still... calcAsm is 13% slower than calcC. revolution wrote: Also a system interrupt in the middle of one function call can affect the times you see. Try running the test many times over for each call, and display the shortest time. Not only interrupts, but task switching, DMA, page faults, etc... This is always a problem. |
|||
06 Sep 2019, 10:24 |
|
revolution 06 Sep 2019, 10:50
fpissarra wrote: Not only interrupts, but task switching, DMA, page faults, etc... This is always a problem. |
|||
06 Sep 2019, 10:50 |
|
fpissarra 06 Sep 2019, 12:20
To be fair, since the original code uses MMX, here's a good test using SSE, in both C and ASM functions:
Code: #include <stddef.h> #include <x86intrin.h> /* Using SSE */ void calcC ( int *px, int *py, size_t elements ) { size_t blocks = elements >> 2; // 4 integers blocks counter. size_t remaining = elements & 3; // if elements isn't multiple of 4. __m128i *p1, *p2; p1 = (__m128i *)px; p2 = (__m128i *)py; while ( blocks-- ) { /* Explicitly using intrinsic function. These 2 lines could be changed to a single one: *p1++ += *p2++; But I want to be sure if 'int's are being modified! */ *p1 = _mm_add_epi32( *p1, *p2 ); p1++; p2++; } /* Change remaining integers, if any */ px = (int *)p1; py = (int *)p2; while ( remaining-- ) *px++ += *py++; } /* The same routine, handmade, using SSE. */ void calcAsm ( int *px, int *py, size_t elements ) { __asm__ __volatile__ ( " movq %%rdx,%%r8\n\t" " shrq $2,%%r8\n\t" /* R8 = # of 4 integers blocks... */ " andl $3,%%edx\n\t" /* EDX = remaining integers */ " testq %%r8,%%r8\n\t" " jz 2f\n\t" "1:\n\t" " movdqa (%%rdi),%%xmm0\n\t" " paddd (%%rsi),%%xmm0\n\t" " movaps %%xmm0,(%%rdi)\n\t" " addq $16,%%rdi\n\t" " addq $16,%%rsi\n\t" " subq $1,%%r8\n\t" " jnz 1b\n\t" "2:\n\t" " testl %%edx,%%edx\n\t" /* If there are still integers on array... */ " jz 4f\n\t" "3:\n\t" " movl (%%rdi),%%eax\n\t" " addl (%%rsi),%%eax\n\t" " movl %%eax,(%%rdi)\n\t" " addq $4,%%rdi\n\t" " addq $4,%%rsi\n\t" " subl $1,%%edx\n\t" " jnz 3b\n\t" "4:\n\t" : : "D" (px), "S" (py), "d" (elements) : "r8", "memory" ); } My test, this time using an i5-3570: Code: $ ./test Cycles: calcAsm: 38064 calcC: 31628 calcAsm is, still, slower (20% slower!)... My point: It doesn't mean a routine is created in assembly that it is, automatically, faster than what a good routine, and a good compiler, can create... In the original test the comparison was between a non optimized C function and a good asm routine (not fair!)... As @revolution points out. There are lots of other things to consider when measuring performance (and creating an awesome routine that takes into account things as caches usage, macro and microfusion of instructions, etc, etc, etc...)... |
|||
06 Sep 2019, 12:20 |
|
Melissa 06 Sep 2019, 18:20
Code: ~/.../examples/test >>> g++ -Ofast -std=c++17 -march=native test.cpp -o testcpp -pthread -lstdc++ ~/.../examples/test >>> ./testcpp Cycles: calcAsm: 19872 calcC: 13176 ~/.../examples/test >>> g++ -O2 -std=c++17 -march=native test.cpp -o testcpp -pthread -lstdc++ ~/.../examples/test >>> ./testcpp Cycles: calcAsm: 20052 calcC: 53496 ~/.../examples/test >>> clang -O2 -std=c++17 -march=native test.cpp -o testcpp -pthread -lstdc++ ~/.../examples/test >>> ./testcpp Cycles: calcAsm: 14616 calcC: 11808 ~/.../examples/test >>> clang -Ofast -std=c++17 -march=native test.cpp -o testcpp -pthread -lstdc++ ~/.../examples/test >>> ./testcpp Cycles: calcAsm: 14220 calcC: 11844 clang uses inline asm differently. |
|||
06 Sep 2019, 18:20 |
|
Furs 08 Sep 2019, 15:02
Here's what godbolt spit out:
Code: calcC(int*, int*, unsigned long): mov r9, rdx mov rcx, rsi and edx, 3 mov rax, rdi shr r9, 2 lea r8, [r9-1] test r9, r9 je .L5 .L2: movdqa xmm0, XMMWORD PTR [rax] paddd xmm0, XMMWORD PTR [rcx] sub r8, 1 add rax, 16 add rcx, 16 movaps XMMWORD PTR [rax-16], xmm0 cmp r8, -1 jne .L2 sal r9, 4 add rdi, r9 add rsi, r9 .L5: test rdx, rdx je .L16 mov eax, DWORD PTR [rsi] add DWORD PTR [rdi], eax cmp rdx, 1 je .L1 mov eax, DWORD PTR [rsi+4] add DWORD PTR [rdi+4], eax cmp rdx, 2 je .L1 mov eax, DWORD PTR [rsi+8] add DWORD PTR [rdi+8], eax .L1: ret .L16: ret |
|||
08 Sep 2019, 15:02 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.