flat assembler
Message board for the users of flat assembler.

Index > Main > No code profiler for FASM people. We are so poor!

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
redsock



Joined: 09 Oct 2009
Posts: 435
Location: Australia
redsock 05 Dec 2014, 19:46
rdtsc definitely works, my issues early on with my automatic profiler was that thanks to CPU power management, the actual timings can change throughout the course of a runtime.

The only way I could come up with that wasn't horrific to deal with was to create a prolog/epilog macro that I use inside each function definition, such that I can turn it on/off... excerpts from my library's profiler illustrate:

Code:
macro prolog name {
        local .function_name,.function_start
        if public_funcs
                public name
        end if
        if framepointers
                push rbp
                mov rbp, rsp
        end if
        if profiling
                jmp     .function_start
                cleartext .function_name, `name
                calign
                .function_start:
                ; important for obvious reasons that we preserve _all_ registers
                push    rax rcx rdx rdi rsi r8 r9 r10 r11
                mov     rdi, .function_name
                call    profiler$enter                  ; single arg rdi == function name rc4 offset into code
                pop     r11 r10 r9 r8 rsi rdi rdx rcx rax
        end if
}

macro epilog {
        if profiling
                ; important for obvious reasons that we preserve _all_ registers
                push    rax rcx rdx rdi rsi r8 r9 r10 r11
                call    profiler$leave
                pop     r11 r10 r9 r8 rsi rdi rdx rcx rax
        end if
        if framepointers
                leave
        end if
        ret
}    


Of course, per @revolution's comments, it certainly _does_ change things, but it does give you a very accurate view of where time is spent, how many calls and cycles, etc. and provides most importantly the ability to do optimization-based profiling (noting here that the extra stack/memory cache poisoning makes some of that impossible).

At program start, I have to kick the power management a bit, so I use this code to come up with the delta timing to "do my best" to eliminate the rdtsc from the resultant output:

Code:
globals
{
        _profiler_tscaccum      dq      0
        _profiler_deltamod      dq      0
        _profiler_last_tsc      dq      0
        _profiler_current       dq      0
        _profiler_block_free    dq      0
        _profiler_block         dq      0
}

falign
public profiler$init
profiler$init:
        ; burning purpose: compute our _profiler_deltamod


        ; NOTE: we need to sit and spin for 20us or so to kick the cpu power management up before we start our timing

        ; ALSO: without this pmloop init here, what can happen is that power management upscales our process, and thus
        ; what we start out as for timing information turns out to be incorrect, and we can get negative CPC results
        ; in the profiler output.

        xor     eax, eax
        mov     ecx, 320000000
calign
.pmloop:
        xor     eax, ecx
        add     eax, ecx
        sub     ecx, 1
        jnz     .pmloop

        readtsc
        mov     [_profiler_last_tsc], rax
        mov     ecx, 1000000
calign
.loop:
        readtsc
        sub     ecx, 1
        jnz     .loop
        ; rax has our most recent result
        sub     rax, qword [_profiler_last_tsc]
        ; rax now has how long it took to do a bunch of readtscs....
        mov     ecx, 1000000
        xor     edx, edx
        div     rcx
        mov     [_profiler_deltamod], rax
        ret    


From there, profiler$enter and profiler$leave just keep a stack-based call graph as it must so that I can do overall timings per function, or actual call graph timings, etc.

Can post the rest of profiler$enter and profiler$leave if any of you are interested in the gory details..

it does work nicely though Smile
Post 05 Dec 2014, 19:46
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 06 Dec 2014, 16:13
gens wrote:
run your loop 100 000 times then divide the run length by 100 000

sampling profilers work by reading the instruction pointer every once in a while, amongst other things
so the loop HAS to run for a while if you want usable results


tools like perf, codexl and such can show you where the program spends it's time, the whole program
and only if it runs for a longer time
for a loop you would be better to measure it yourself
you have to know about things like alignment, instruction and data caches and so on or you can get misleading results

oh ye, and put your cpu to performance mode
forgot to say, also symbols under linux


Two thumbs up for your great technical advice.
Post 06 Dec 2014, 16:13
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 06 Dec 2014, 16:16
AsmGuru62 wrote:
code generation utility


Busted! You are Oleh Yuschuk!
When can we get that 64-bit Olly?
Post 06 Dec 2014, 16:16
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 06 Dec 2014, 16:22
redsock wrote:
From there, profiler$enter and profiler$leave just keep a stack-based call graph as it must so that I can do overall timings per function, or actual call graph timings, etc.

Can post the rest of profiler$enter and profiler$leave if any of you are interested in the gory details..

it does work nicely though Smile


redsock, you promised us your library for linux to be released this year. I hope this profiling macro is part of it. Just to remind you, we in December already Razz
Post 06 Dec 2014, 16:22
View user's profile Send private message Reply with quote
AsmGuru62



Joined: 28 Jan 2004
Posts: 1694
Location: Toronto, Canada
AsmGuru62 06 Dec 2014, 21:11
Smile
Oleh is from Germany, I think.
I am in Canada.
Post 06 Dec 2014, 21:11
View user's profile Send private message Send e-mail Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 07 Dec 2014, 00:20
Ok, it's tough for me. Completely new on this.

I tested this code to see how it performs and I linked it to gcc to see how C performs as well. Just a simple code to see how the FPU handles cotangent;

Code:
;./fasm test.asm
;gcc -s test.o -o test
format elf64
public main
extrn printf

main:
        push    rbp
        mov     rbp,rsp
        mov     r15,100000
.begin:
        push    return_val
        push    [radian_val]
        call    cotangent
        dec     r15
        jz      .done
        jmp     .begin
.done:
        movq    xmm0,[return_val]
        mov     rdi,fmt
        call    printf
        mov     rsp,rbp
        pop     rbp
        ret

radian_val  dq 3.4
return_val  dq 0.0
fmt  db "%.10f",0ah,0

;+++++++++++++++++++++++++
cotangent:
        push    rbp
        mov     rbp,rsp
        finit
        fld     qword[rbp+16]
        fptan
        fxch
        fld1
        fdiv    st0,st1
        fstp    qword[rbp+16]
        mov     rax,qword[rbp+24]
        mov     rcx,qword[rbp+16]
        mov     qword[rax],rcx
        mov     rsp,rbp
        pop     rbp
        ret    


I ran it 100000 times, and here's the report from perf.

Code:
perf stat -B ./test 
3.7833373376

 Performance counter stats for './test':

         97.371105 task-clock (msec)         #    0.992 CPUs utilized          
                11 context-switches          #    0.113 K/sec                  
                 0 cpu-migrations            #    0.000 K/sec                  
               531 page-faults               #    0.005 M/sec                  
       136,841,275 cycles                    #    1.405 GHz                     [83.61%]
       101,180,410 stalled-cycles-frontend   #   73.94% frontend cycles idle    [83.63%]
        14,139,295 stalled-cycles-backend    #   10.33% backend  cycles idle    [32.90%]
         3,494,856 instructions              #    0.03  insns per cycle        
                                             #   28.95  stalled cycles per insn [54.53%]
           640,293 branches                  #    6.576 M/sec                   [69.56%]
             2,724 branch-misses             #    0.43% of all branches         [83.34%]

       0.098168684 seconds time elapsed    


One figure is highlighted in RED [73.94%] and I don't know why.

So how to interpret this kind of information in regards to performance?
Post 07 Dec 2014, 00:20
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 07 Dec 2014, 00:22
And here is a one shot result;
Code:
 Performance counter stats for './test':

          0.942214 task-clock (msec)         #    0.642 CPUs utilized          
                 2 context-switches          #    0.002 M/sec                  
                 0 cpu-migrations            #    0.000 K/sec                  
               140 page-faults               #    0.149 M/sec                  
         1,312,050 cycles                    #    1.393 GHz                    
           484,149 stalled-cycles-frontend   #   36.90% frontend cycles idle   
     <not counted> stalled-cycles-backend  
     <not counted> instructions            
     <not counted> branches                
     <not counted> branch-misses           

       0.001468048 seconds time elapsed    
Post 07 Dec 2014, 00:22
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 07 Dec 2014, 00:29
Is it Bad, bAD or BAD? Where and how can I improve it?

Edit: I am on AMD FX6100 six-core, Linux 64-bit. Part of the information;

Code:
...blah
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 21
model           : 1
model name      : AMD FX(tm)-6100 Six-Core Processor
stepping        : 2
microcode       : 0x6000613
cpu MHz         : 3322.171
cache size      : 2048 KB
...blah    
Post 07 Dec 2014, 00:29
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 07 Dec 2014, 04:21
Out of topic;

Why is it that I have to use indirection when dealing with movq + xmm register?

Code:
;./fasm test.asm
;gcc -s test.o -o test -lm
format elf64

public  main
extrn   printf
extrn   tan

section '.text' executable align 16
main:
        push    rbp
        mov     rbp,rsp

        movq    xmm0,[radian_val]       ;argument 1
        call    tan                     ;call Tangent
        movq    [return_val],xmm0       ;save return value
        
        ;movq   xmm0,[return_val]       ;why this fails?
        mov     rax,return_val          ;But these 2 work       
        movq    xmm0,[rax]        
        mov     rdi,fmt         
        call    printf

        mov     rsp,rbp
        pop     rbp
        ret

section '.data' writeable
radian_val dq 3.4
return_val dq 0.0
fmt db "%f",0ah,0    


The commented line prints 0.00000, and I have to replace it with that 2 subsequent lines to make it work.
Post 07 Dec 2014, 04:21
View user's profile Send private message Reply with quote
cod3b453



Joined: 25 Aug 2004
Posts: 618
cod3b453 07 Dec 2014, 13:11
Ideally, if you're trying to optimise code for throughput the stall and branch misses should be reduced as best as possible; the degree of success for this is based on how well your code fits into the physically available execution units/pipelines, which in reality is rarely anywhere near perfect.

The criteria for "badness" is up to you. Your example will be spending a lot of time stalled due to repeated calls to a very small function containing fptan and fdiv and their surrounding interdependencies. Now it may be that you're happy with the performance, the tan calculation is always going to take a relatively long time regardless. However, if the throughput is still not enough you'd know that you could pipeline the tan calculation loop using SSE or even pre-compute values in a LUT (either directly or with interpolation); note that both of these are more complex and much larger implementations, which may be a problem based on your criteria.
Post 07 Dec 2014, 13:11
View user's profile Send private message Reply with quote
gens



Joined: 18 Feb 2013
Posts: 161
gens 07 Dec 2014, 13:31
if you worry about performance, why not read a couple of docs about it
http://www.agner.org/optimize/#manuals
http://support.amd.com/TechDocs/25112.PDF
http://x264dev.multimedia.cx/archives/8 (plenty more on that blog)

with proper debug symbols perf can give you per instruction latency
there was some talk on how to do it with FASM
for an example, run "perf top" (looks better if perf was compiled with newt)

for measuring a loop you can just use the rdtsc instruction, as recommended
there are plenty of examples on gugl on how to use it, some even proper

in your code you are using tan(), a heavy libc call
that is where most of the time executing is spent

the agner link above includes an instruction latency and throughput table
note how division takes a long time and floating point things have fairly high latencies

oh ye, and
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
and to return it to normal
echo "ondemand" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Post 07 Dec 2014, 13:31
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 07 Dec 2014, 13:49
cod3b453 wrote:
Ideally, if you're trying to optimise code for throughput the stall and branch misses should be reduced as best as possible


Ok thats a good start for me. I guess I need to dig up more on stalls and branch misses. Btw, is 'stall' any way related to the OS, like deadlock or semaphore or completely on the CPU alone? Because if it is, then I guess nothing much that I can do. Kernel really scares me.

cod3b453 wrote:
the degree of success for this is based on how well your code fits into the physically available execution units/pipelines, which in reality is rarely anywhere near perfect.


Is it something that we can overcome by code or completely controlled by the CPU?
Post 07 Dec 2014, 13:49
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 07 Dec 2014, 14:08
gens wrote:
if you worry about performance, why not read a couple of docs about it
http://www.agner.org/optimize/#manuals
http://support.amd.com/TechDocs/25112.PDF
http://x264dev.multimedia.cx/archives/8 (plenty more on that blog)

with proper debug symbols perf can give you per instruction latency
there was some talk on how to do it with FASM
for an example, run "perf top" (looks better if perf was compiled with newt)

for measuring a loop you can just use the rdtsc instruction, as recommended
there are plenty of examples on gugl on how to use it, some even proper

in your code you are using tan(), a heavy libc call
that is where most of the time executing is spent

the agner link above includes an instruction latency and throughput table
note how division takes a long time and floating point things have fairly high latencies

oh ye, and
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
and to return it to normal
echo "ondemand" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Thanks gens. You been a great help for me.

Yeah I do read from those materials. But those are heavy. May take some time to digest them properly. Maybe I do need a PhD to understand these things :p
Post 07 Dec 2014, 14:08
View user's profile Send private message Reply with quote
cod3b453



Joined: 25 Aug 2004
Posts: 618
cod3b453 07 Dec 2014, 14:30
Here stall refers to the hardware (b)locking until a condition is met. Most processors will pipeline instructions so even though instructions generally take more than one cycle it is able to execute previous/following instructions in parallel if there are not interdependencies.

Going back to your example, fptan will take hundreds of cycles but then is followed by fxch which depends on its output so it will be stalled until the result is present. It should be possible to insert other instruction between these two (assuming there was something else useful that can be done here) that would reduce the overall time of your code's execution. [Some processors even provide out-of-order execution capability which try to help with this but this can't necessarily be relied upon].

It's usually possible for two instructions to execute in parallel per core (FPU/SSE & ALU) though in some cases it can be more but this is dependent on the hardware implementation.
Post 07 Dec 2014, 14:30
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 07 Dec 2014, 14:40
cod3b453 wrote:
Here stall refers to the hardware (b)locking until a condition is met. Most processors will pipeline instructions so even though instructions generally take more than one cycle it is able to execute previous/following instructions in parallel if there are not interdependencies.

Going back to your example, fptan will take hundreds of cycles but then is followed by fxch which depends on its output so it will be stalled until the result is present. It should be possible to insert other instruction between these two (assuming there was something else useful that can be done here) that would reduce the overall time of your code's execution. [Some processors even provide out-of-order execution capability which try to help with this but this can't necessarily be relied upon].
.


Like 'fwait' or 'nop' between them or something that I can do in parallel like;

Code:
push    rbp
        mov     rbp,rsp
        finit
        fld     qword[rbp+16]
        fptan
        mov     rax,qword[rbp+24]
        mov     rcx,qword[rbp+16]
        mov     qword[rax],rcx
        fxch
        fld1
        fdiv    st0,st1
        fstp    qword[rbp+16]
        mov     rsp,rbp
        pop     rbp
        ret
    


Like that?
Post 07 Dec 2014, 14:40
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 07 Dec 2014, 15:21
Here's the result after reorganizing the code as codebase suggester:

Code:
3.7833373376

 Performance counter stats for './test':

         88.418802 task-clock (msec)         #    0.993 CPUs utilized          
                 3 context-switches          #    0.034 K/sec                  
                 0 cpu-migrations            #    0.000 K/sec                  
               531 page-faults               #    0.006 M/sec                  
       124,282,392 cycles                    #    1.406 GHz                     [81.90%]
        90,693,056 stalled-cycles-frontend   #   72.97% frontend cycles idle    [81.92%]
        14,046,912 stalled-cycles-backend    #   11.30% backend  cycles idle    [36.35%]
         3,845,781 instructions              #    0.03  insns per cycle        
                                             #   23.58  stalled cycles per insn [54.55%]
           774,747 branches                  #    8.762 M/sec                   [68.36%]
             4,494 branch-misses             #    0.58% of all branches         [81.92%]

       0.089021726 seconds time elapsed    


Stalled cycles front end is reduced by some huge margin Very Happy

New code layout;
Code:
;+++++++++++++++++++++++++
cotangent:
        push    rbp
        mov     rbp,rsp
        finit
        fld     qword[rbp+16]
        fptan
        mov     rax,qword[rbp+24]
        mov     qword[rax],rcx
        fxch
        fld1
        fdiv    st0,st1
        fstp    qword[rbp+16]
        mov     rcx,qword[rbp+16]
        mov     rsp,rbp
        pop     rbp
        ret    


But the branch miss increases almost double? But I am starting to get the picture here Very Happy
Post 07 Dec 2014, 15:21
View user's profile Send private message Reply with quote
system error



Joined: 01 Sep 2013
Posts: 670
system error 07 Dec 2014, 15:30
What 'branch' are we talking here? I don't see no 'branch' except testing r15 against 0.
Post 07 Dec 2014, 15:30
View user's profile Send private message Reply with quote
gens



Joined: 18 Feb 2013
Posts: 161
gens 07 Dec 2014, 17:17
system error wrote:
Yeah I do read from those materials. But those are heavy. May take some time to digest them properly. Maybe I do need a PhD to understand these things :p


ofc it'l take time, as all new things need (like learning to read and write, for example)
lots of sleep too

here's some things to google:
memory access of all kinds
including the whole cache system on a modern cpu, memory alignment and such
out-of-order execution and how branching requires a sync (hint: instruction latency)
how a cpu does bitwise operations, addition/subtraction, multiplication, division and how they are combined in more complex things like sine and tangents
floating point
and so on

Computer Architecture: A Quantitative Approach, i find to be a good book about cpu's
and guglin around, ofc

theres more, like this fun thing that you don't actually have to know about
http://en.wikipedia.org/wiki/MESI_protocol


in a while when those things sink in you will know why revolution was telling you that it's not that simple to do measurements with good precision
Post 07 Dec 2014, 17:17
View user's profile Send private message Reply with quote
redsock



Joined: 09 Oct 2009
Posts: 435
Location: Australia
redsock 08 Dec 2014, 04:29
system error wrote:
redsock, you promised us your library for linux to be released this year. I hope this profiling macro is part of it. Just to remind you, we in December already Razz
Smile As it happens, my doing HTML5/CSS/JS is _considerably_ slower than my assembly language programming, haha... I stopped coding and have been exclusively doing nothing but presentation materials for it... Assuming I don't run into any more ridiculous roadblocks, am hoping to release it all next week, wish me luck Smile
Post 08 Dec 2014, 04:29
View user's profile Send private message Reply with quote
cod3b453



Joined: 25 Aug 2004
Posts: 618
cod3b453 08 Dec 2014, 19:42
system error wrote:
...
Like 'fwait' or 'nop' between them or something that I can do in parallel like;

Code:
push    rbp
        mov     rbp,rsp
        finit
        fld     qword[rbp+16]
        fptan
        mov     rax,qword[rbp+24]
        mov     rcx,qword[rbp+16]
        mov     qword[rax],rcx
        fxch
        fld1
        fdiv    st0,st1
        fstp    qword[rbp+16]
        mov     rsp,rbp
        pop     rbp
        ret
    


Like that?
Yes something along those lines though all you need is:
Code:
fld qword [rbp+16]
; something here
fptan
; lots of stuff here
fdivrp st1,st0
; some stuff here
fstp qword [rbp+16]    
Post 08 Dec 2014, 19:42
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.