flat assembler
Message board for the users of flat assembler.

Index > Linux > Benchmarking 64-bit and SSE 128-bit?

Author
Thread Post new topic Reply to topic
FlierMate



Joined: 21 Jan 2021
Posts: 208
FlierMate
Inspired by OpenGL - Inverse alpha? topic of discussion, where Madis731 mentioned:

Madis731 wrote:

2) Have you checked MMX/SSE routines yet? There are excellent instructions to compare 2 to 4 pixels in one blow (8 bytes in MMX, 16 bytes in SSE).


So, out of curiosity, I made two Linux x64 programs to compare the speed of memory copying in 64-bit (general register) and 128-bit (SSE, XMM registers).
(My CPU does not support AVX, so I would not be able to run test locally)

And this is the result! The output are verified correct, 8192KB of data transfer.

Code:
boo@debian:~/fasm$ diff -s x64.txt sse.txt

Files x64.txt and sse.txt are identical
    


This is how I run the test on Linux:

Code:
boo@debian:~/fasm$ time ./x64 > x64.txt

real    0m0.060s
user    0m0.004s
sys     0m0.035s

boo@debian:~/fasm$ time ./sse > sse.txt

real    0m0.044s
user    0m0.001s
sys     0m0.041s
boo@debian:~/fasm$ 
    


SSE.asm is faster than x64.asm!!!

This is not the first time I perform benchmark test, but definitely is the first time using CPU extension set!

My code might be not optimum, so I need your advice for correct benchmarking, please.

SSE.asm
Code:
; SSE Benchmark 1.0
;
; Requires Pentium III series of CPU and above

format ELF64 executable 3

segment readable executable

entry $

    lea rbx, [_msg]
    lea rdx, [_sse]
        movups xmm0,[rbx]       ;copy unaligned double quad word from 128-bit memory location to register
        mov rcx, 8192000 
.redo:  
        movups [rdx+rcx-16],xmm0   ;copy unaligned double quad word from register to 128-bit memory location
        sub rcx, 16
        jne .redo       
        
        mov     edx,8192000
        lea     rsi,[_sse]
        mov     edi,1           ; STDOUT
        mov     eax,1
        syscall
        
        xor     edi,edi         
        mov     eax,60
        syscall

segment readable writeable

_sse rb 8192000
_msg db '1234567812345678'
    


x64.asm
Code:
format ELF64 executable 3

segment readable executable

entry $

    lea rbx, [_msg]
    lea rdx, [_x64]
        mov r8,[rbx]          ;copy quad word from 64-bit memory location to register
        mov rcx, 8192000
.redo:  
        mov [rdx+rcx-8],r8      ;copy quad word from register to 64-bit memory location
        sub rcx, 8
        jne .redo       

        mov     edx,8192000
        lea     rsi,[_x64]
        mov     edi,1           ; STDOUT
        mov     eax,1
        syscall
        
        xor     edi,edi         
        mov     eax,60
        syscall

segment readable writeable

_x64 rb 8192000
_msg db '12345678'
    


Thank you for reading![/code]
Post 30 Dec 2021, 17:04
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 18449
Location: In your JS exploiting you and your system
revolution
Each system is different.

For each system you have you can run the benchmarks to see the results for that system.

Also: Sadly, benchmarks give only a very slight indication about how code will run in an entire program. So be wary of being too confident about what will happen when benchmarked code is included in a whole program, the outcome can be different.
Post 30 Dec 2021, 17:48
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.