flat assembler
Message board for the users of flat assembler.

Index > Main > which is the fastest ?

Goto page 1, 2  Next
Author
Thread Post new topic Reply to topic
ouadji



Joined: 24 Dec 2008
Posts: 1081
Location: Belgium
ouadji 20 Sep 2010, 20:20

which is the fastest ?

edit:
not :mov qword[edi],mm1
but :movq qword[edi],mm1


Code:
     xor eax,eax
     mov ecx,Y

A :  stosd
     loop A

;------------------------ or

     xor eax,eax
     movd mm1,eax
     mov ecx,Y/2

B:   mov qword[edi],mm1
     add edi,8
     loop B

     (emms)
    

_________________
I am not young enough to know everything (Oscar Wilde)- Image


Last edited by ouadji on 20 Sep 2010, 21:06; edited 1 time in total
Post 20 Sep 2010, 20:20
View user's profile Send private message Send e-mail Reply with quote
Fanael



Joined: 03 Jul 2009
Posts: 168
Fanael 20 Sep 2010, 20:23
As usual - it depends upon what CPU are you optimizing for (and upon several other things, also).
Post 20 Sep 2010, 20:23
View user's profile Send private message Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 20 Sep 2010, 20:46
you can also make this:

Code:
xor ax,eax
mov ecx,Y/2
A:
stosd
stosd
loop A:
    


or

Code:
xor eax,eax
mov ecx,Y
rep stosd
    


i believe that the faster is to do it with only one instruction. with rep prefix.

i will compare it right now and give feedback
Post 20 Sep 2010, 20:46
View user's profile Send private message Visit poster's website Reply with quote
ouadji



Joined: 24 Dec 2008
Posts: 1081
Location: Belgium
ouadji 20 Sep 2010, 20:50
Quote:
i will compare it right now and give feedback
thank you edfed

_________________
I am not young enough to know everything (Oscar Wilde)- Image
Post 20 Sep 2010, 20:50
View user's profile Send private message Send e-mail Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 20 Sep 2010, 21:01
mov qword[edi],mm1 doesn't compile.
Post 20 Sep 2010, 21:01
View user's profile Send private message Visit poster's website Reply with quote
ouadji



Joined: 24 Dec 2008
Posts: 1081
Location: Belgium
ouadji 20 Sep 2010, 21:05

not "mov" ... but "movq"

_________________
I am not young enough to know everything (Oscar Wilde)- Image
Post 20 Sep 2010, 21:05
View user's profile Send private message Send e-mail Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 20 Sep 2010, 21:16
i have results.
the faster on my PIII is test1 (~1180 clocks).
test 3 is a little slower (~1260clocks), and test2 is twice slower (~2200 clocks).

Code:
X:
Y=10h
rd Y*2

@@: db 'test1',0
align 4
dd test1.size
dd @b
test1:
     xor eax,eax
     mov edi,X
     mov ecx,Y
@@:  stosd
     loop @b
.size=$-test1
     mov eax,0
     ret
;------------------------
@@: db 'test2',0
align 4
dd test2.size
dd @b
test2:
     xor eax,eax
     movd mm1,eax
     mov edi,X
     mov ecx,Y/2
@@:  movq [edi],mm1
     add edi,8
     loop @b
.size= $-test2
     mov eax,0
     ret
;------------------------

@@: db 'test3',0
align 4
dd test3.size
dd @b
Y=10h
test3:
     xor eax,eax
     mov edi,X
     mov ecx,Y
     rep stosd
.size=$-test1
     mov eax,0
     ret
;------------------------
    
Post 20 Sep 2010, 21:16
View user's profile Send private message Visit poster's website Reply with quote
ouadji



Joined: 24 Dec 2008
Posts: 1081
Location: Belgium
ouadji 20 Sep 2010, 21:43

surprising result (thank you edfed)
"loop" is faster than "rep" ...
and regarding "movq/mm1", we forget that!

_________________
I am not young enough to know everything (Oscar Wilde)- Image
Post 20 Sep 2010, 21:43
View user's profile Send private message Send e-mail Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 20 Sep 2010, 21:51
there are many things to take in consideration.
the fat that we write to ES, and ES is the video memory in my test application.

the fact that the buffer is only 10 dwords.
and many other things.

but i am really sure when i say test1 is relativelly the faster.

[edit]
but, i just tested with 100 dwords, and test3 becomes the faster.

[edit1]
but i tested with CPU overloaded, and MMX is faster.

as a conclusion, a short table should use test1 (loop), a medium table use test3 (rep), and a long table use test2(mmx).
Post 20 Sep 2010, 21:51
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20303
Location: In your JS exploiting you and your system
revolution 20 Sep 2010, 23:14
Testing for speed on a single CPU with a single test and one test set over a short run is rather pointless.
Post 20 Sep 2010, 23:14
View user's profile Send private message Visit poster's website Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 20 Sep 2010, 23:46
but very interresting.
many things are deveiled by the time capture, and scroll the plotted results.
it shows many interactions with the real state of the machine.
Smile

i correct, test3 is the faster. and the difference is huge.
test3 = 500 clocks, test1 = 2500 clocks, test2 = 2700

it really depends on the load of the machine, it is interresting to see that an algorythm can be influenced by the rest of the system.
Post 20 Sep 2010, 23:46
View user's profile Send private message Visit poster's website Reply with quote
ouadji



Joined: 24 Dec 2008
Posts: 1081
Location: Belgium
ouadji 21 Sep 2010, 07:42

This final results seem rather logical.
The test3 is the only one with which the loop is performed inside the processor.
Thank you edfed for all tests and results.

_________________
I am not young enough to know everything (Oscar Wilde)- Image
Post 21 Sep 2010, 07:42
View user's profile Send private message Send e-mail Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 23 Sep 2010, 10:34
then, a capture of the results.
Image

alternates test1, test2, test3 and test4
test 3 = lower plots, test 1 = higher plots.

then, even it doesn't give absolute results, it gives a good idea of relative speed between several methods.

i am currently trying to simplify the process, because for the moment, it is only a compile time application that needs this kind of implementation:
Code:
align 4
.list:  List \
        test1,\
        test2,\
        test3,\
        test4

Y=100h
X:
rd Y*2

@@: db 'test1',0
align 4
dd test1.size
dd @b
test1:
     xor eax,eax
     mov edi,X
     mov ecx,Y
@@:  stosd
     loop @b
.size=$-test1
     mov eax,0
     ret
;------------------------
@@: db 'test2',0
align 4
dd test2.size
dd @b
test2:
     xor eax,eax
     movd mm1,eax
     mov edi,X
     mov ecx,Y/2
@@:  movq [es:edi],mm1
     add edi,8
     loop @b
.size= $-test2
     mov eax,0
     ret
;------------------------

@@: db 'test3',0
align 4
dd test3.size
dd @b
test3:
     xor eax,eax
     mov edi,X
     mov ecx,Y
     rep stosd
.size=$-test3
     mov eax,0
     ret
;------------------------

@@: db 'test4',0
align 4
dd test4.size
dd @b
test4:
     xor eax,eax
     mov edi,X
     mov ecx,Y
@@:  mov [es:edi],eax
     add edi,4
     loop @b
.size=$-test4
     mov eax,0
     ret
;------------------------

    


maybe a .bin include at load time is better.
compile each snippet in .bin, with org 0
just reference them in a list of paths, at compile time, or at execution time with a command line, and then, begin the test???


Description:
Filesize: 1.7 KB
Viewed: 11725 Time(s)

tricktest.png


Post 23 Sep 2010, 10:34
View user's profile Send private message Visit poster's website Reply with quote
ouadji



Joined: 24 Dec 2008
Posts: 1081
Location: Belgium
ouadji 23 Sep 2010, 13:49

great result edfed, good work.

_________________
I am not young enough to know everything (Oscar Wilde)- Image
Post 23 Sep 2010, 13:49
View user's profile Send private message Send e-mail Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4024
Location: vpcmpistri
bitRAKE 24 Sep 2010, 01:06
Code:
test8:
     pxor mm0,mm0
     mov edi,X
     mov ecx,Y/2
@@:  movntq [es:edi+ecx*8-8],mm0
     loop @b
     ret


test9:
     pxor xmm0,xmm0
     mov edi,X
     mov ecx,Y/4
@@:  movntdq [es:edi],xmm0
     add edi,16
     loop @b
     ret    
These might only be useful for larger blocks - haven't tested. It's also important if the memory will be used soon after clearing - if so, then non-temporal stores should be avoided. Of course, SSE2 is needed.
Post 24 Sep 2010, 01:06
View user's profile Send private message Visit poster's website Reply with quote
ouadji



Joined: 24 Dec 2008
Posts: 1081
Location: Belgium
ouadji 24 Sep 2010, 12:55

I'd be interested to know the test9 result
thank you edfed

_________________
I am not young enough to know everything (Oscar Wilde)- Image
Post 24 Sep 2010, 12:55
View user's profile Send private message Send e-mail Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 24 Sep 2010, 19:09
from what i see, i can say that test9 is not working properlly, but as i am not sure to have SSE2 on my PIII, i am not sure of the exact result.

you can test yourself on your machine if you want.

i renamed test9, it is now test5.
i didn't included test8 because it looks like the mmx solution in test2


Last edited by edfed on 24 Sep 2010, 19:38; edited 1 time in total
Post 24 Sep 2010, 19:09
View user's profile Send private message Visit poster's website Reply with quote
ouadji



Joined: 24 Dec 2008
Posts: 1081
Location: Belgium
ouadji 24 Sep 2010, 19:22

Sorry for this reply in french, to complex for me in english.

Edfed, j'ai testé sur mon PC. C'est un 4 coeurs Q6600 sous XP pro.
Je peux sélectionner les tests avec les flèches haut et bas, mais quand je dépasse "test4" le programme se ferme de lui même. Je n'ai donc pas eu accès à votre "test5", dommage !

_________________
I am not young enough to know everything (Oscar Wilde)- Image
Post 24 Sep 2010, 19:22
View user's profile Send private message Send e-mail Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4330
Location: Now
edfed 24 Sep 2010, 19:31
ok, it means that you don't support sse2. lol. but j'ai des doutes. faudrait voir si ce n'est pas le système qui bloque l'utilisation de ce jeux d'instruction en mode dos.

i think you will need to test it with an other type of code. for example just test it in a PE console code instead of .com

hem. i think the problem come form the use of CPUID in test9.

i will correct it and post a new version.

but i propose you to directlly compile my source after adding your own tests.


Description:
Download
Filename: B2h.zip
Filesize: 16.25 KB
Downloaded: 314 Time(s)



Last edited by edfed on 28 Feb 2011, 14:06; edited 1 time in total
Post 24 Sep 2010, 19:31
View user's profile Send private message Visit poster's website Reply with quote
ouadji



Joined: 24 Dec 2008
Posts: 1081
Location: Belgium
ouadji 24 Sep 2010, 20:18

Q6600 does not support sse2 ?

avx ,d'accord, mais sse2 ?
j'espère quand même "que oui" ... !
(j'ai un marteau juste à côté du PC) Crying or Very sad

ps: no source, only a" file.com"

_________________
I am not young enough to know everything (Oscar Wilde)- Image
Post 24 Sep 2010, 20:18
View user's profile Send private message Send e-mail Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page 1, 2  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.