[fasm2] Silly benchmark idea that actually works

Today, to celebrate my achievement with fasm2/fasmg this month, I have decided to revisit an old code I did, which can benchmark even single instructions in modern processors (which is a very hard thing to do, because of clock modulation + task switching + non true to life realtime kernel wobbling). It does this by looping millions of times the same code, which makes the processor goes into "maximum mode", and then pound the average number of cycles for each loop:

Code:

format ELF64 executable 3 at 1#00000000h
entry Start

include 'anon_label.inc'
include 'fastcall_BETA2.inc'
include 'stdio.inc'

TEST_CYCLES = 400000000       ; Number of iterations within test loop

_data
          pAffinity                ddq 1
          testcnt                  dq 0

          buffer1                  rb 512
          buffer2                  rb 512

_code
          Start:                   endbr64
                                   fputs("Setting priority for 1 core only... ", **stdout);
                                   fflush(**stdout);
                                   errno();
                                   mov       rbx, rax
                                   sched_setaffinity(NULL, sizeof(pAffinity), &pAffinity);
                                   test      eax, eax
                                   jz        @f
                                   perror("fail");
                                   jmp       @f2
                              @@   perror(NULL);
                              @@   mov       [rbx], dword 0
                                   setpriority(PRIO_PROCESS, NULL, PRIO_HIGHEST);
                                   mov       edx, [rbx]
                                   test      edx, edx
                                   jz        @f
                                   perror("Error elevating priority");
                                   jmp       @f2
                              @@   perror("Setting priority");
                              @@   puts(<27,"[0;35mStarting...",27,"[0m",0>);
                                   fflush(**stdout);
                                   mov       rbx, TEST_CYCLES
                                   mov       r12, rbx

                                   ; **********************************
                                   ; ***    Start Benchmark code    ***

                                   lfence
                                   rdtsc
                                   push      rdx
                                   push      rax

                                   ; ### Tested code goes here ###
                              @@@  call      Dummy0
                                   ; ### End tested code       ###

                                   dec       r12
                                   jnz       @@b

                              @@   mfence
                                   rdtsc
                                   pop       rbp
                                   pop       rcx

                                   ; ***     End Benchmark code     ***
                                   ; **********************************

                                   sub       eax, ebp
                                   sbb       edx, ecx
                                   shrd      r15, rax, 32
                                   shrd      r15, rdx, 32
                                   mov       rax, r15
                                   cqo
                                   xor       esi, esi
                                   div       rbx
                                   shld      rsi, r15, 32
                                   printf(<"TSC run length: %08X:%08X.",10,\
                                        "Iteration average: %u cycles.",10, \
                                        "Total iteration: %lu loops.",10,0>, \
                                        esi, r15d, eax, rbx);
                                   exit(0);

          Dummy0:                  push      rbp
                                   mov       rbp, rsp
                                   sub       rsp, 128
                                   leave
                                   ret

          Dummy1:                  enter     128, 0
                                   leave
                                   ret

One can learn from testing things inside this, because it really shows how certain things may impact performance.
As an example, I attached 2 "Dummy" functions that create 128 bytes of "local variable", one with 'enter/leave' pair, and a modern approach used even by modern high level compilers by default. You can exchange them in the "Test Area", then compile and run to test. I'll let to you guys to figure out the difference in between them (because it can vary from processor to processor indeed).

The most amazing thing about this code (made with fasm2 and my macros available at my fastcall post) is the size of the final executable: less than 2 kB! Against 10kB of my previous nasm version, that uses external linker 'ld' and 'strip' command to get rid of some of the dirt standard linkers put in binaries.
This is one of the reasons I love fasm/fasm2, and it is the achievement I made default at my include files. And due to the size, I include the binary and this source code attached.

Have in mind that 'dec r12/jnz @@b' adds +1 cycle to the average. And it is not subtracted in code, because I account this duo as part of the test code.