flat assembler
Message board for the users of flat assembler.
![]() |
Author |
|
Jessé 31 May 2025, 18:09
Today, to celebrate my achievement with fasm2/fasmg this month, I have decided to revisit an old code I did, which can benchmark even single instructions in modern processors (which is a very hard thing to do, because of clock modulation + task switching + non true to life realtime kernel wobbling). It does this by looping millions of times the same code, which makes the processor goes into "maximum mode", and then pound the average number of cycles for each loop:
Code: format ELF64 executable 3 at 1#00000000h entry Start include 'anon_label.inc' include 'fastcall_BETA2.inc' include 'stdio.inc' TEST_CYCLES = 400000000 ; Number of iterations within test loop _data pAffinity ddq 1 testcnt dq 0 buffer1 rb 512 buffer2 rb 512 _code Start: endbr64 fputs("Setting priority for 1 core only... ", **stdout); fflush(**stdout); errno(); mov rbx, rax sched_setaffinity(NULL, sizeof(pAffinity), &pAffinity); test eax, eax jz @f perror("fail"); jmp @f2 @@ perror(NULL); @@ mov [rbx], dword 0 setpriority(PRIO_PROCESS, NULL, PRIO_HIGHEST); mov edx, [rbx] test edx, edx jz @f perror("Error elevating priority"); jmp @f2 @@ perror("Setting priority"); @@ puts(<27,"[0;35mStarting...",27,"[0m",0>); fflush(**stdout); mov rbx, TEST_CYCLES mov r12, rbx ; ********************************** ; *** Start Benchmark code *** lfence rdtsc push rdx push rax ; ### Tested code goes here ### @@@ call Dummy0 ; ### End tested code ### dec r12 jnz @@b @@ mfence rdtsc pop rbp pop rcx ; *** End Benchmark code *** ; ********************************** sub eax, ebp sbb edx, ecx shrd r15, rax, 32 shrd r15, rdx, 32 mov rax, r15 cqo xor esi, esi div rbx shld rsi, r15, 32 printf(<"TSC run length: %08X:%08X.",10,\ "Iteration average: %u cycles.",10, \ "Total iteration: %lu loops.",10,0>, \ esi, r15d, eax, rbx); exit(0); Dummy0: push rbp mov rbp, rsp sub rsp, 128 leave ret Dummy1: enter 128, 0 leave ret One can learn from testing things inside this, because it really shows how certain things may impact performance. As an example, I attached 2 "Dummy" functions that create 128 bytes of "local variable", one with 'enter/leave' pair, and a modern approach used even by modern high level compilers by default. You can exchange them in the "Test Area", then compile and run to test. I'll let to you guys to figure out the difference in between them (because it can vary from processor to processor indeed). The most amazing thing about this code (made with fasm2 and my macros available at my fastcall post) is the size of the final executable: less than 2 kB! Against 10kB of my previous nasm version, that uses external linker 'ld' and 'strip' command to get rid of some of the dirt standard linkers put in binaries. This is one of the reasons I love fasm/fasm2, and it is the achievement I made default at my include files. And due to the size, I include the binary and this source code attached. Have in mind that 'dec r12/jnz @@b' adds +1 cycle to the average. And it is not subtracted in code, because I account this duo as part of the test code.
_________________ jesse6 |
|||||||||||
![]() |
|
< Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.