flat assembler
Message board for the users of flat assembler.

Index > Main > Performance difference Windows and Linux with FASM

Author
Thread Post new topic Reply to topic
coen



Joined: 13 Oct 2015
Posts: 14
coen
I'm experimenting a little with FASM under Windows 2012 R2 and CentOS 7.1 (both 64 bit) and for some reason my program runs much slower on CentOS than Windows and I don't understand why.

Windows code:
Code:
format PE64 console
entry start  

include 'win64ax.inc'

section '.text' code readable executable  

start:

mov rcx, 30
L1:
    mov rsi, 673223862
    mov r9, 0

    vxorpd ymm0, ymm0, ymm0
    vpbroadcastd ymm0, [one]

    vxorpd ymm1, ymm1, ymm1

    L2:
        vpaddd ymm1, ymm0, ymm1

        add r9, 1
        sub rsi, 1
    jnz L2

    sub rcx, 1
jnz L1

invoke  ExitProcess, 00


section '.data' data readable writeable  

one dd 1


section '.idata' import data readable writeable  

library kernel32, 'kernel32.dll'

include 'api\Kernel32.inc'
    


Linux code:
Code:
format ELF64 executable at 0000000100000000h

segment readable executable

entry $

mov rcx, 30
L1:
    mov rsi, 673223862
    mov r9, 0

    vxorpd ymm0, ymm0, ymm0
    vpbroadcastd ymm0, [one]

    vxorpd ymm1, ymm1, ymm1

    L2:
        vpaddd ymm1, ymm0, ymm1

        add r9, 1
        sub rsi, 1
    jnz L2

    sub rcx, 1
jnz L1

xor     edi,edi
mov     eax,60 ; sys_exit
syscall


segment readable writeable

one dd 1
    


When I run these on my test server (dual boot) it takes the Windows version 10.5 seconds to complete and when running CentOS it takes 12.6 seconds.

Does anyone know a possible reason for this performance difference?
Post 13 Oct 2015, 15:06
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
I suspect that it is because of the unaligned variable "one". The ELF formats do not automatically align things as the PE format does.

Try inserting "align 16" or similar before the declaration of "one".
Post 13 Oct 2015, 15:15
View user's profile Send private message Visit poster's website Reply with quote
typedef



Joined: 25 Jul 2010
Posts: 2913
Location: 0x77760000
typedef
Check difference in CPU, memory, background activities etc.
Post 13 Oct 2015, 15:15
View user's profile Send private message Reply with quote
coen



Joined: 13 Oct 2015
Posts: 14
coen
revolution wrote:
I suspect that it is because of the unaligned variable "one". The ELF formats do not automatically align things as the PE format does.

Try inserting "align 16" or similar before the declaration of "one".


I didn't knew that FASM aligns automatically in PE format and not in ELF format, so thanks for that piece of useful information Smile, however I tried it and didn't make a difference in this case. The 'one' variable is used only 30 times when running this code.

typedef wrote:
Check difference in CPU, memory, background activities etc.


The server is a test server and doesn't have any other background processes running that could cause the difference in performance. Also I ran the test several times and posted average times here (I should have mentioned that in my opening post).
Post 13 Oct 2015, 15:45
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
coen wrote:
I didn't knew that FASM aligns automatically in PE format and not in ELF format, so thanks for that piece of useful information Smile,
It is a consequence of the format. PE format forces alignment of new sections, fasm has no choice. ELF does not force alignment, so the programmer has to say when they want it.
Post 13 Oct 2015, 23:16
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
As for your problem, there could be many reasons why you see a difference. Perhaps Linux is clocking the CPU differently because of a light task load or some power management setting.
Post 13 Oct 2015, 23:20
View user's profile Send private message Visit poster's website Reply with quote
coen



Joined: 13 Oct 2015
Posts: 14
coen
revolution wrote:
As for your problem, there could be many reasons why you see a difference. Perhaps Linux is clocking the CPU differently because of a light task load or some power management setting.


I've double-checked this but the CPU is a Intel Xeon CPU and is always running at 2,3Ghz, so no throttling there, but thanks for the suggestion.

I'm gonna try to run some CPU benchmarks later using Geekbench to see if it also shows a performance difference based on the OS or not, this might help me to determine whether the difference is caused by my source code or not.
Post 14 Oct 2015, 07:16
View user's profile Send private message Reply with quote
fasmnewbie



Joined: 01 Mar 2011
Posts: 553
fasmnewbie
coen

For a 2 seconds differences (which is quite a lot), is there a chance that your code is actually inolved in memory swapping or any memory-related issue by linux? AFAIK, nothing in the north bridge should produce a 2 seconds difference. It must have come from the south (e,g hard disk)
Post 14 Oct 2015, 11:39
View user's profile Send private message Visit poster's website Reply with quote
coen



Joined: 13 Oct 2015
Posts: 14
coen
@fasmnewbie, I'm running the code as shown above, so there's no memory or IO access going on at all, just CPU registers. I also think this big performance difference shouldn't be possible for just CPU instructions, but yet it is Sad
Post 14 Oct 2015, 14:12
View user's profile Send private message Reply with quote
randall



Joined: 03 Dec 2011
Posts: 155
Location: Poland
randall
You can use likwid (https://github.com/RRZE-HPC/likwid) to find out what is going on on Linux.
Post 14 Oct 2015, 14:31
View user's profile Send private message Visit poster's website Reply with quote
JohnFound



Joined: 16 Jun 2003
Posts: 3500
Location: Bulgaria
JohnFound
Try to play a little with the CPU frequency governor. Put it to "performance" instead of "on demand" and test again.
Code:
sudo cpupower frequency-set -g performance    
Post 14 Oct 2015, 14:52
View user's profile Send private message Visit poster's website ICQ Number Reply with quote
coen



Joined: 13 Oct 2015
Posts: 14
coen
@JohnFound, thanks for your input, I've tried it but sadly no difference in results..
Post 14 Oct 2015, 15:06
View user's profile Send private message Reply with quote
coen



Joined: 13 Oct 2015
Posts: 14
coen
Well... I've run some Geekbench benchmarks to see if it also shows a difference in performance and it does.

Windows result: 3106
CentOS result: 2536

A difference of approximately 20%, the same as my test code. I still have no idea how this is possible, but at least I know the problem is not in my code or FASM Smile
Post 14 Oct 2015, 15:20
View user's profile Send private message Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc
coen
This might be related to the lazy context switching policy. An OS normally tends to avoid saving the full context of an old thread unless newer threads need to use the unsaved part. Depending on how many other threads in the system use AVX the performance might increase due to avoided saving overhead or degrade due to excessive exception rates required to lazily save the context. AFAIR Windows engages lazy switching for the first use of the Coprocessor/MMX/SSE/AVX and then uses non-lazy context saving when switching happens between threads once spotted to have used the corresponding extension.

Maybe CentOS does smth. different. Or Maybe CentOS is just slow on context switching in general. This can be easily checked by comparing runtimes of simple empty loops.

_________________
Faith is a superposition of knowledge and fallacy
Post 14 Oct 2015, 15:35
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17279
Location: In your JS exploiting you and your system
revolution
If it is related to context switching and/or exceptions/interrupts then you can try setting the affinity to core 2 or 3 or something. Most OSes use core 0 for the house-keeping tasks and the other cores won't be affected.
Post 14 Oct 2015, 16:03
View user's profile Send private message Visit poster's website Reply with quote
l_inc



Joined: 23 Oct 2009
Posts: 881
l_inc
revolution
Quote:
If it is related to context switching and/or exceptions/interrupts then you can try setting the affinity to core 2 or 3 or something.

And highest priority. Not sure what you mean by "house keeping", but threads are normally assigned to cores in some form of round robin. And interrupt delivery is also often evenly distributed among cores.

_________________
Faith is a superposition of knowledge and fallacy
Post 14 Oct 2015, 16:13
View user's profile Send private message Reply with quote
coen



Joined: 13 Oct 2015
Posts: 14
coen
I tried running it with a higher priority using the 'nice' command, but that didn't make a difference either.

I don't know what to try anymore at this point, as a last resort I'm gonna reinstall CentOS and hope that the problem will be gone with a fresh install.
Post 14 Oct 2015, 20:10
View user's profile Send private message Reply with quote
ACP



Joined: 23 Sep 2006
Posts: 204
ACP
coen wrote:

I don't know what to try anymore at this point, as a last resort I'm gonna reinstall CentOS and hope that the problem will be gone with a fresh install.


If you do that you may loose the trace of problem if the problems vanishes after reinstall. At least make a backup before overwriting your CentOS installation. I am really curious what is the source of the delay. One more question that comes to mind is: are running stock kernel?
Post 14 Oct 2015, 22:22
View user's profile Send private message Reply with quote
Melissa



Joined: 12 Apr 2012
Posts: 71
Melissa
I don't have Windows but Linux version on my i7 4790 shows:

[bmaxa@maxa-pc assembler]$ time ./slowlin

real 0m5.966s
user 0m5.953s
sys 0m0.000s

kernel is 4.3 rc5

Since max turbo is 4.0ghz it is approximately same as your Windows time at 2.3ghz.
Post 15 Oct 2015, 03:19
View user's profile Send private message Reply with quote
coen



Joined: 13 Oct 2015
Posts: 14
coen
ACP wrote:
coen wrote:

I don't know what to try anymore at this point, as a last resort I'm gonna reinstall CentOS and hope that the problem will be gone with a fresh install.


If you do that you may loose the trace of problem if the problems vanishes after reinstall. At least make a backup before overwriting your CentOS installation. I am really curious what is the source of the delay. One more question that comes to mind is: are running stock kernel?


I've reinstalled CentOS and the performance difference is gone! Sadly I didn't see your post till it was too late so I did not make a backup... I was running stock kernel.

I want to thank all of you for trying to find the source of this weird performance issue, unfortunately we'll never know what was the real cause Sad, for that I apologize..
Post 15 Oct 2015, 12:40
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.