flat assembler
Message board for the users of flat assembler.
Index
> Main > which is the fastest ? Goto page 1, 2 Next |
Author |
|
Fanael 20 Sep 2010, 20:23
As usual - it depends upon what CPU are you optimizing for (and upon several other things, also).
|
|||
20 Sep 2010, 20:23 |
|
edfed 20 Sep 2010, 20:46
you can also make this:
Code: xor ax,eax mov ecx,Y/2 A: stosd stosd loop A: or Code: xor eax,eax mov ecx,Y rep stosd i believe that the faster is to do it with only one instruction. with rep prefix. i will compare it right now and give feedback |
|||
20 Sep 2010, 20:46 |
|
ouadji 20 Sep 2010, 20:50
Quote: i will compare it right now and give feedback |
|||
20 Sep 2010, 20:50 |
|
edfed 20 Sep 2010, 21:01
mov qword[edi],mm1 doesn't compile.
|
|||
20 Sep 2010, 21:01 |
|
ouadji 20 Sep 2010, 21:05
not "mov" ... but "movq" |
|||
20 Sep 2010, 21:05 |
|
edfed 20 Sep 2010, 21:16
i have results.
the faster on my PIII is test1 (~1180 clocks). test 3 is a little slower (~1260clocks), and test2 is twice slower (~2200 clocks). Code: X: Y=10h rd Y*2 @@: db 'test1',0 align 4 dd test1.size dd @b test1: xor eax,eax mov edi,X mov ecx,Y @@: stosd loop @b .size=$-test1 mov eax,0 ret ;------------------------ @@: db 'test2',0 align 4 dd test2.size dd @b test2: xor eax,eax movd mm1,eax mov edi,X mov ecx,Y/2 @@: movq [edi],mm1 add edi,8 loop @b .size= $-test2 mov eax,0 ret ;------------------------ @@: db 'test3',0 align 4 dd test3.size dd @b Y=10h test3: xor eax,eax mov edi,X mov ecx,Y rep stosd .size=$-test1 mov eax,0 ret ;------------------------ |
|||
20 Sep 2010, 21:16 |
|
ouadji 20 Sep 2010, 21:43
surprising result (thank you edfed) "loop" is faster than "rep" ... and regarding "movq/mm1", we forget that! |
|||
20 Sep 2010, 21:43 |
|
edfed 20 Sep 2010, 21:51
there are many things to take in consideration.
the fat that we write to ES, and ES is the video memory in my test application. the fact that the buffer is only 10 dwords. and many other things. but i am really sure when i say test1 is relativelly the faster. [edit] but, i just tested with 100 dwords, and test3 becomes the faster. [edit1] but i tested with CPU overloaded, and MMX is faster. as a conclusion, a short table should use test1 (loop), a medium table use test3 (rep), and a long table use test2(mmx). |
|||
20 Sep 2010, 21:51 |
|
revolution 20 Sep 2010, 23:14
Testing for speed on a single CPU with a single test and one test set over a short run is rather pointless.
|
|||
20 Sep 2010, 23:14 |
|
edfed 20 Sep 2010, 23:46
but very interresting.
many things are deveiled by the time capture, and scroll the plotted results. it shows many interactions with the real state of the machine. i correct, test3 is the faster. and the difference is huge. test3 = 500 clocks, test1 = 2500 clocks, test2 = 2700 it really depends on the load of the machine, it is interresting to see that an algorythm can be influenced by the rest of the system. |
|||
20 Sep 2010, 23:46 |
|
ouadji 21 Sep 2010, 07:42
This final results seem rather logical. The test3 is the only one with which the loop is performed inside the processor. Thank you edfed for all tests and results. |
|||
21 Sep 2010, 07:42 |
|
edfed 23 Sep 2010, 10:34
then, a capture of the results.
alternates test1, test2, test3 and test4 test 3 = lower plots, test 1 = higher plots. then, even it doesn't give absolute results, it gives a good idea of relative speed between several methods. i am currently trying to simplify the process, because for the moment, it is only a compile time application that needs this kind of implementation: Code: align 4 .list: List \ test1,\ test2,\ test3,\ test4 Y=100h X: rd Y*2 @@: db 'test1',0 align 4 dd test1.size dd @b test1: xor eax,eax mov edi,X mov ecx,Y @@: stosd loop @b .size=$-test1 mov eax,0 ret ;------------------------ @@: db 'test2',0 align 4 dd test2.size dd @b test2: xor eax,eax movd mm1,eax mov edi,X mov ecx,Y/2 @@: movq [es:edi],mm1 add edi,8 loop @b .size= $-test2 mov eax,0 ret ;------------------------ @@: db 'test3',0 align 4 dd test3.size dd @b test3: xor eax,eax mov edi,X mov ecx,Y rep stosd .size=$-test3 mov eax,0 ret ;------------------------ @@: db 'test4',0 align 4 dd test4.size dd @b test4: xor eax,eax mov edi,X mov ecx,Y @@: mov [es:edi],eax add edi,4 loop @b .size=$-test4 mov eax,0 ret ;------------------------ maybe a .bin include at load time is better. compile each snippet in .bin, with org 0 just reference them in a list of paths, at compile time, or at execution time with a command line, and then, begin the test???
|
||||||||||
23 Sep 2010, 10:34 |
|
ouadji 23 Sep 2010, 13:49
great result edfed, good work. |
|||
23 Sep 2010, 13:49 |
|
bitRAKE 24 Sep 2010, 01:06
Code: test8: pxor mm0,mm0 mov edi,X mov ecx,Y/2 @@: movntq [es:edi+ecx*8-8],mm0 loop @b ret test9: pxor xmm0,xmm0 mov edi,X mov ecx,Y/4 @@: movntdq [es:edi],xmm0 add edi,16 loop @b ret |
|||
24 Sep 2010, 01:06 |
|
ouadji 24 Sep 2010, 12:55
I'd be interested to know the test9 result thank you edfed |
|||
24 Sep 2010, 12:55 |
|
edfed 24 Sep 2010, 19:09
from what i see, i can say that test9 is not working properlly, but as i am not sure to have SSE2 on my PIII, i am not sure of the exact result.
you can test yourself on your machine if you want. i renamed test9, it is now test5. i didn't included test8 because it looks like the mmx solution in test2 Last edited by edfed on 24 Sep 2010, 19:38; edited 1 time in total |
|||
24 Sep 2010, 19:09 |
|
ouadji 24 Sep 2010, 19:22
Sorry for this reply in french, to complex for me in english. Edfed, j'ai testé sur mon PC. C'est un 4 coeurs Q6600 sous XP pro. Je peux sélectionner les tests avec les flèches haut et bas, mais quand je dépasse "test4" le programme se ferme de lui même. Je n'ai donc pas eu accès à votre "test5", dommage ! |
|||
24 Sep 2010, 19:22 |
|
edfed 24 Sep 2010, 19:31
ok, it means that you don't support sse2. lol. but j'ai des doutes. faudrait voir si ce n'est pas le système qui bloque l'utilisation de ce jeux d'instruction en mode dos.
i think you will need to test it with an other type of code. for example just test it in a PE console code instead of .com hem. i think the problem come form the use of CPUID in test9. i will correct it and post a new version. but i propose you to directlly compile my source after adding your own tests.
Last edited by edfed on 28 Feb 2011, 14:06; edited 1 time in total |
|||||||||||
24 Sep 2010, 19:31 |
|
ouadji 24 Sep 2010, 20:18
Q6600 does not support sse2 ? avx ,d'accord, mais sse2 ? j'espère quand même "que oui" ... ! (j'ai un marteau juste à côté du PC) ps: no source, only a" file.com" |
|||
24 Sep 2010, 20:18 |
|
Goto page 1, 2 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.