flat assembler
Message board for the users of flat assembler.
![]() Goto page Previous 1, 2 |
Author |
|
jackblack 08 Sep 2012, 10:10
I'm sorry. I missed the 2-nd page of this topic. Now when I read you message, I made 2 binary files with aligned and unaligned code.
I uploaded them to my Google Disk. Here are the links: https://docs.google.com/open?id=0B_IqYbGFxbzeNm9oNkVncFYwdm8 https://docs.google.com/open?id=0B_IqYbGFxbzeWHpRVjhBb2Zkb3c If you can test them on many CPUs, then it will be great. On my PC I got following results: Unaligned code Code: ################################################################################ # Array library speed test # ################################################################################ This test operates on 10000000 elements wide flt64_t arrays in 100 rounds. Addition: Classic scalar code time: 8.469804 sec LinAsm vector code time: 8.507637 sec Subtraction: Classic scalar code time: 8.525537 sec LinAsm vector code time: 8.505540 sec Multiplication: Classic scalar code time: 8.524257 sec LinAsm vector code time: 8.476669 sec Division: Classic scalar code time: 22.848977 sec LinAsm vector code time: 17.084971 sec Aligned code Code: ################################################################################ # Array library speed test # ################################################################################ This test operates on 10000000 elements wide flt64_t arrays in 100 rounds. Addition: Classic scalar code time: 8.492586 sec LinAsm vector code time: 8.521262 sec Subtraction: Classic scalar code time: 8.544034 sec LinAsm vector code time: 8.527114 sec Multiplication: Classic scalar code time: 8.546466 sec LinAsm vector code time: 8.501122 sec Division: Classic scalar code time: 22.880396 sec LinAsm vector code time: 17.094742 sec In these tests aligned code looks slower, but it is not real slowdown. It is because my PC did other tasks in the same time. |
|||
![]() |
|
LocoDelAssembly 08 Sep 2012, 22:45
Code: loco@Ubuntu-VAIO:~/Desktop$ ./ArrayTest_aligned && ./ArrayTest_unaligned && ./ArrayTest_aligned && ./ArrayTest_unaligned ################################################################################ # Array library speed test # ################################################################################ This test operates on 10000000 elements wide flt64_t arrays in 100 rounds. Addition: Classic scalar code time: 2.037717 sec LinAsm vector code time: 1.898958 sec Subtraction: Classic scalar code time: 2.023843 sec LinAsm vector code time: 1.887334 sec Multiplication: Classic scalar code time: 2.040209 sec LinAsm vector code time: 1.892707 sec Division: Classic scalar code time: 11.264690 sec LinAsm vector code time: 5.922402 sec ################################################################################ # Array library speed test # ################################################################################ This test operates on 10000000 elements wide flt64_t arrays in 100 rounds. Addition: Classic scalar code time: 2.058576 sec LinAsm vector code time: 1.871824 sec Subtraction: Classic scalar code time: 2.024328 sec LinAsm vector code time: 1.916844 sec Multiplication: Classic scalar code time: 2.023449 sec LinAsm vector code time: 1.889113 sec Division: Classic scalar code time: 11.173319 sec LinAsm vector code time: 5.934704 sec ################################################################################ # Array library speed test # ################################################################################ This test operates on 10000000 elements wide flt64_t arrays in 100 rounds. Addition: Classic scalar code time: 2.016459 sec LinAsm vector code time: 1.892393 sec Subtraction: Classic scalar code time: 2.026710 sec LinAsm vector code time: 2.007047 sec Multiplication: Classic scalar code time: 2.222073 sec LinAsm vector code time: 2.026494 sec Division: Classic scalar code time: 11.787808 sec LinAsm vector code time: 6.126286 sec ################################################################################ # Array library speed test # ################################################################################ This test operates on 10000000 elements wide flt64_t arrays in 100 rounds. Addition: Classic scalar code time: 2.278812 sec LinAsm vector code time: 2.034858 sec Subtraction: Classic scalar code time: 2.272292 sec LinAsm vector code time: 2.026706 sec Multiplication: Classic scalar code time: 2.106645 sec LinAsm vector code time: 2.063666 sec Division: Classic scalar code time: 11.977029 sec LinAsm vector code time: 6.221352 sec loco@Ubuntu-VAIO:~/Desktop$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 12.04.1 LTS Release: 12.04 Codename: precise loco@Ubuntu-VAIO:~/Desktop$ uname -a Linux Ubuntu-VAIO 3.2.0-29-generic #46-Ubuntu SMP Fri Jul 27 17:03:23 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux This was ran under a VirtualBox, host is Windows 7 64-bit. The computer is this one: Sony VAIO - VPCEJ16FX/B Short specs: Code: Processor Processor Type : Intel® Core™ i3-2310M Processor Technology : Dual Core Processor Speed : 2.10GHz1 Processor Cache : 3MB Memory Installed Memory : 4GB (2GB x 2) Memory Type/Speed : DDR3/1333MHz Max. Memory : 8GB Can't test on the Athlon64 now, I'll come back to you with the results later. PS: If anyone trying to run the binaries gets the message "No such file or directory", after making sure of having execute permissions set up right (with "chmod +x ArrayTest_*"), check you have either a file or a symlink at /lib/ld-linux-x86-64.so.2, if not add a symlink to your interpreter. In my case I solve the problem with "sudo ln -s /lib/x86_64-linux-gnu/ld-2.15.so ld-linux-x86-64.so.2" PS2: I've realized the other day that my CPU re-introduced the trace cache, but smaller than the one the Pentium 4 has, so if unalignment isn't hitting the performance perhaps it is due to this "L0" cache. |
|||
![]() |
|
jackblack 10 Sep 2012, 05:56
I guess, I got the answer why there are not differences between aligned and unaligned code in recent CPUs. Pentium4 (NetBurst) and almost all processors after it, has cache of decoded instructions. Assembly instructions are decoded to internal RISC commands only ones, and then they are stored in the instructions cache. So when you organize small loops and function calls, the decoded instructions are read from CPU cache, but not from memory. That is why code alignment is not actual nowadays.
It's just my own interpretation of the results. Who has any information about why the aligned and unaligned code has the same performance, please let me know. |
|||
![]() |
|
Goto page Previous 1, 2 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.