flat assembler
Message board for the users of flat assembler.
![]() Goto page Previous 1, 2 |
Author |
|
system error 06 Dec 2014, 16:13
gens wrote: run your loop 100 000 times then divide the run length by 100 000 Two thumbs up for your great technical advice. |
|||
![]() |
|
system error 06 Dec 2014, 16:16
AsmGuru62 wrote: code generation utility Busted! You are Oleh Yuschuk! When can we get that 64-bit Olly? |
|||
![]() |
|
system error 06 Dec 2014, 16:22
redsock wrote: From there, profiler$enter and profiler$leave just keep a stack-based call graph as it must so that I can do overall timings per function, or actual call graph timings, etc. redsock, you promised us your library for linux to be released this year. I hope this profiling macro is part of it. Just to remind you, we in December already ![]() |
|||
![]() |
|
AsmGuru62 06 Dec 2014, 21:11
![]() Oleh is from Germany, I think. I am in Canada. |
|||
![]() |
|
system error 07 Dec 2014, 00:20
Ok, it's tough for me. Completely new on this.
I tested this code to see how it performs and I linked it to gcc to see how C performs as well. Just a simple code to see how the FPU handles cotangent; Code: ;./fasm test.asm ;gcc -s test.o -o test format elf64 public main extrn printf main: push rbp mov rbp,rsp mov r15,100000 .begin: push return_val push [radian_val] call cotangent dec r15 jz .done jmp .begin .done: movq xmm0,[return_val] mov rdi,fmt call printf mov rsp,rbp pop rbp ret radian_val dq 3.4 return_val dq 0.0 fmt db "%.10f",0ah,0 ;+++++++++++++++++++++++++ cotangent: push rbp mov rbp,rsp finit fld qword[rbp+16] fptan fxch fld1 fdiv st0,st1 fstp qword[rbp+16] mov rax,qword[rbp+24] mov rcx,qword[rbp+16] mov qword[rax],rcx mov rsp,rbp pop rbp ret I ran it 100000 times, and here's the report from perf. Code: perf stat -B ./test 3.7833373376 Performance counter stats for './test': 97.371105 task-clock (msec) # 0.992 CPUs utilized 11 context-switches # 0.113 K/sec 0 cpu-migrations # 0.000 K/sec 531 page-faults # 0.005 M/sec 136,841,275 cycles # 1.405 GHz [83.61%] 101,180,410 stalled-cycles-frontend # 73.94% frontend cycles idle [83.63%] 14,139,295 stalled-cycles-backend # 10.33% backend cycles idle [32.90%] 3,494,856 instructions # 0.03 insns per cycle # 28.95 stalled cycles per insn [54.53%] 640,293 branches # 6.576 M/sec [69.56%] 2,724 branch-misses # 0.43% of all branches [83.34%] 0.098168684 seconds time elapsed One figure is highlighted in RED [73.94%] and I don't know why. So how to interpret this kind of information in regards to performance? |
|||
![]() |
|
system error 07 Dec 2014, 00:22
And here is a one shot result;
Code: Performance counter stats for './test': 0.942214 task-clock (msec) # 0.642 CPUs utilized 2 context-switches # 0.002 M/sec 0 cpu-migrations # 0.000 K/sec 140 page-faults # 0.149 M/sec 1,312,050 cycles # 1.393 GHz 484,149 stalled-cycles-frontend # 36.90% frontend cycles idle <not counted> stalled-cycles-backend <not counted> instructions <not counted> branches <not counted> branch-misses 0.001468048 seconds time elapsed |
|||
![]() |
|
system error 07 Dec 2014, 00:29
Is it Bad, bAD or BAD? Where and how can I improve it?
Edit: I am on AMD FX6100 six-core, Linux 64-bit. Part of the information; Code: ...blah processor : 0 vendor_id : AuthenticAMD cpu family : 21 model : 1 model name : AMD FX(tm)-6100 Six-Core Processor stepping : 2 microcode : 0x6000613 cpu MHz : 3322.171 cache size : 2048 KB ...blah |
|||
![]() |
|
system error 07 Dec 2014, 04:21
Out of topic;
Why is it that I have to use indirection when dealing with movq + xmm register? Code: ;./fasm test.asm ;gcc -s test.o -o test -lm format elf64 public main extrn printf extrn tan section '.text' executable align 16 main: push rbp mov rbp,rsp movq xmm0,[radian_val] ;argument 1 call tan ;call Tangent movq [return_val],xmm0 ;save return value ;movq xmm0,[return_val] ;why this fails? mov rax,return_val ;But these 2 work movq xmm0,[rax] mov rdi,fmt call printf mov rsp,rbp pop rbp ret section '.data' writeable radian_val dq 3.4 return_val dq 0.0 fmt db "%f",0ah,0 The commented line prints 0.00000, and I have to replace it with that 2 subsequent lines to make it work. |
|||
![]() |
|
cod3b453 07 Dec 2014, 13:11
Ideally, if you're trying to optimise code for throughput the stall and branch misses should be reduced as best as possible; the degree of success for this is based on how well your code fits into the physically available execution units/pipelines, which in reality is rarely anywhere near perfect.
The criteria for "badness" is up to you. Your example will be spending a lot of time stalled due to repeated calls to a very small function containing fptan and fdiv and their surrounding interdependencies. Now it may be that you're happy with the performance, the tan calculation is always going to take a relatively long time regardless. However, if the throughput is still not enough you'd know that you could pipeline the tan calculation loop using SSE or even pre-compute values in a LUT (either directly or with interpolation); note that both of these are more complex and much larger implementations, which may be a problem based on your criteria. |
|||
![]() |
|
gens 07 Dec 2014, 13:31
if you worry about performance, why not read a couple of docs about it
http://www.agner.org/optimize/#manuals http://support.amd.com/TechDocs/25112.PDF http://x264dev.multimedia.cx/archives/8 (plenty more on that blog) with proper debug symbols perf can give you per instruction latency there was some talk on how to do it with FASM for an example, run "perf top" (looks better if perf was compiled with newt) for measuring a loop you can just use the rdtsc instruction, as recommended there are plenty of examples on gugl on how to use it, some even proper in your code you are using tan(), a heavy libc call that is where most of the time executing is spent the agner link above includes an instruction latency and throughput table note how division takes a long time and floating point things have fairly high latencies oh ye, and echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor and to return it to normal echo "ondemand" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor |
|||
![]() |
|
system error 07 Dec 2014, 13:49
cod3b453 wrote: Ideally, if you're trying to optimise code for throughput the stall and branch misses should be reduced as best as possible Ok thats a good start for me. I guess I need to dig up more on stalls and branch misses. Btw, is 'stall' any way related to the OS, like deadlock or semaphore or completely on the CPU alone? Because if it is, then I guess nothing much that I can do. Kernel really scares me. cod3b453 wrote: the degree of success for this is based on how well your code fits into the physically available execution units/pipelines, which in reality is rarely anywhere near perfect. Is it something that we can overcome by code or completely controlled by the CPU? |
|||
![]() |
|
system error 07 Dec 2014, 14:08
gens wrote: if you worry about performance, why not read a couple of docs about it Yeah I do read from those materials. But those are heavy. May take some time to digest them properly. Maybe I do need a PhD to understand these things :p |
|||
![]() |
|
cod3b453 07 Dec 2014, 14:30
Here stall refers to the hardware (b)locking until a condition is met. Most processors will pipeline instructions so even though instructions generally take more than one cycle it is able to execute previous/following instructions in parallel if there are not interdependencies.
Going back to your example, fptan will take hundreds of cycles but then is followed by fxch which depends on its output so it will be stalled until the result is present. It should be possible to insert other instruction between these two (assuming there was something else useful that can be done here) that would reduce the overall time of your code's execution. [Some processors even provide out-of-order execution capability which try to help with this but this can't necessarily be relied upon]. It's usually possible for two instructions to execute in parallel per core (FPU/SSE & ALU) though in some cases it can be more but this is dependent on the hardware implementation. |
|||
![]() |
|
system error 07 Dec 2014, 14:40
cod3b453 wrote: Here stall refers to the hardware (b)locking until a condition is met. Most processors will pipeline instructions so even though instructions generally take more than one cycle it is able to execute previous/following instructions in parallel if there are not interdependencies. Like 'fwait' or 'nop' between them or something that I can do in parallel like; Code: push rbp mov rbp,rsp finit fld qword[rbp+16] fptan mov rax,qword[rbp+24] mov rcx,qword[rbp+16] mov qword[rax],rcx fxch fld1 fdiv st0,st1 fstp qword[rbp+16] mov rsp,rbp pop rbp ret Like that? |
|||
![]() |
|
system error 07 Dec 2014, 15:21
Here's the result after reorganizing the code as codebase suggester:
Code: 3.7833373376 Performance counter stats for './test': 88.418802 task-clock (msec) # 0.993 CPUs utilized 3 context-switches # 0.034 K/sec 0 cpu-migrations # 0.000 K/sec 531 page-faults # 0.006 M/sec 124,282,392 cycles # 1.406 GHz [81.90%] 90,693,056 stalled-cycles-frontend # 72.97% frontend cycles idle [81.92%] 14,046,912 stalled-cycles-backend # 11.30% backend cycles idle [36.35%] 3,845,781 instructions # 0.03 insns per cycle # 23.58 stalled cycles per insn [54.55%] 774,747 branches # 8.762 M/sec [68.36%] 4,494 branch-misses # 0.58% of all branches [81.92%] 0.089021726 seconds time elapsed Stalled cycles front end is reduced by some huge margin ![]() New code layout; Code: ;+++++++++++++++++++++++++ cotangent: push rbp mov rbp,rsp finit fld qword[rbp+16] fptan mov rax,qword[rbp+24] mov qword[rax],rcx fxch fld1 fdiv st0,st1 fstp qword[rbp+16] mov rcx,qword[rbp+16] mov rsp,rbp pop rbp ret But the branch miss increases almost double? But I am starting to get the picture here ![]() |
|||
![]() |
|
system error 07 Dec 2014, 15:30
What 'branch' are we talking here? I don't see no 'branch' except testing r15 against 0.
|
|||
![]() |
|
gens 07 Dec 2014, 17:17
system error wrote: Yeah I do read from those materials. But those are heavy. May take some time to digest them properly. Maybe I do need a PhD to understand these things :p ofc it'l take time, as all new things need (like learning to read and write, for example) lots of sleep too here's some things to google: memory access of all kinds including the whole cache system on a modern cpu, memory alignment and such out-of-order execution and how branching requires a sync (hint: instruction latency) how a cpu does bitwise operations, addition/subtraction, multiplication, division and how they are combined in more complex things like sine and tangents floating point and so on Computer Architecture: A Quantitative Approach, i find to be a good book about cpu's and guglin around, ofc theres more, like this fun thing that you don't actually have to know about http://en.wikipedia.org/wiki/MESI_protocol in a while when those things sink in you will know why revolution was telling you that it's not that simple to do measurements with good precision |
|||
![]() |
|
redsock 08 Dec 2014, 04:29
system error wrote: redsock, you promised us your library for linux to be released this year. I hope this profiling macro is part of it. Just to remind you, we in December already ![]() ![]() |
|||
![]() |
|
cod3b453 08 Dec 2014, 19:42
system error wrote: ... Code: fld qword [rbp+16] ; something here fptan ; lots of stuff here fdivrp st1,st0 ; some stuff here fstp qword [rbp+16] |
|||
![]() |
|
Goto page Previous 1, 2 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.