flat assembler
Message board for the users of flat assembler.
Index
> Main > which is the fastest ? Goto page Previous 1, 2 |
Author |
|
ouadji 24 Sep 2010, 21:21
Le Q6600 (Kentsfield) supporte jusqu'au jeux d'instruction sse3 et ssse3 j'ai vérifié "en réel" avec un : PMADDUBSW xmm5,dqword[ebx] et pas d'problème, "ça" passe. Donc, oui, XP qui "bloque" ... possible. |
|||
24 Sep 2010, 21:21 |
|
ouadji 24 Sep 2010, 22:07
j'ai testé.
une boucle de 32 mouvements mais .... elle-même dans une boucle de 100.000.000 ! histoire de pouvoir comparer au chrono. bien entendu ... ecx avec "stosd" et ecx/4 avec movntqd Code: stosd ----- mov ecx,100000000 loop_test: push ecx xor eax,eax mov ecx,Dim_RAZ_dword mov edi,address_global_library rep stosd pop ecx loop loop_test ;------------------------------------ movntqd --------- mov ecx,100000000 loop_test: push ecx pxor xmm0,xmm0 mov edi,address_global_library mov ecx,Dim_RAZ_dword/4 @@: movntdq [edi],xmm0 add edi,16 loop @B pop ecx loop loop_test il n'y a absolument pas photo "stosd" est beaucoup plus rapide (même avec "ecx en entier") le code avec "stosd" est 4 à 5 fois plus rapide que le code avec "movntqd". Toutes ces instructions utilisant les registres mmx et xmm sont super puissantes, mais elles bouffent un sacré paquet de cycles horloge ! |
|||
24 Sep 2010, 22:07 |
|
edfed 24 Sep 2010, 22:38
je pense que tu es sur la bonne route.
de toute façon, le meilleur code en X86, c'est celui qui utilise principalement les registres généraux. les machines X86 sont optimisées pour l'execution des instructions generales (jcc, div, imul,...) , et plus particulierement, les instrucitons RISC (add, neg, not, and,...). ce qui est logique. pour les registres SSE etc, le seul point ou ils sont performants, c'est entre eux, lorsqu'on execute que des instructions de calculs impliquant deux registres. si la mémoire est utilisée, c'est mort j'ai l'impression. en 64 bits, j'en ai aucune idée par contre. |
|||
24 Sep 2010, 22:38 |
|
Alphonso 29 Sep 2010, 15:27
FWIW for me on Core2 test 2 (B) is fastest and dec ecx, jnz is quicker than loop.
Code: --------------------------- --------------------------- Test A with LOOP Test B with LOOP --------------------------- --------------------------- Run 1 - 1152609 Run 1 - 688117 Run 2 - 566806 Run 2 - 283381 Run 3 - 566703 Run 3 - 283364 Run 4 - 628949 Run 4 - 283365 Run 5 - 566755 Run 5 - 283365 Run 6 - 566695 Run 6 - 283364 Run 7 - 566695 Run 7 - 283364 Run 8 - 566695 Run 8 - 284640 Run 9 - 589662 Run 9 - 283365 Run 10- 566746 Run 10- 283364 --------------------------- --------------------------- Test A with DEC Test B with DEC --------------------------- --------------------------- Run 1 - 682456 Run 1 - 616020 Run 2 - 272765 Run 2 - 137827 Run 3 - 272714 Run 3 - 137751 Run 4 - 272714 Run 4 - 137776 Run 5 - 272714 Run 5 - 137751 Run 6 - 272714 Run 6 - 137751 Run 7 - 272722 Run 7 - 137751 Run 8 - 272774 Run 8 - 137751 Run 9 - 274890 Run 9 - 137751 Run 10- 533222 Run 10- 137751 Run under Vista Code: format PE GUI 4.0 entry start include 'win32a.inc' Y = 100000 RunTest = 1 UseLoop = 1 ;=========================================================== section '.text' code readable executable ;----------------------------------------------------------- start: invoke SetProcessAffinityMask,-1,1 invoke Sleep,10 xor ebx,ebx Again: cld mov edi,Buff rdtsc mov [StartEAX+ebx*4],eax mov [StartEDX+ebx*4],edx ;------------------------ if RunTest = 1 xor eax,eax mov ecx,Y align 8 A: stosd if UseLoop = 1 loop A else dec ecx jnz A end if else ;------------------------ xor eax,eax movd mm1,eax mov ecx,Y/2 align 8 B: movq [edi],mm1 add edi,8 if UseLoop = 1 loop B else dec ecx jnz B end if end if ;------------------------ rdtsc mov [EndEAX+ebx*4],eax mov [EndEDX+ebx*4],edx inc ebx cmp ebx,10 jb Again xor ebx,ebx Next: mov eax,[StartEAX+ebx*4] mov edx,[StartEDX+ebx*4] sub [EndEAX+ebx*4],eax sbb [EndEDX+ebx*4],edx inc ebx cmp ebx,10 jb Next cinvoke wsprintf,Buff,wsformat,\ [EndEAX],[EndEDX],\ [EndEAX+4],[EndEDX+4],\ [EndEAX+8],[EndEDX+8],\ [EndEAX+0ch],[EndEDX+0ch],\ [EndEAX+10h],[EndEDX+10h],\ [EndEAX+14h],[EndEDX+14h],\ [EndEAX+18h],[EndEDX+18h],\ [EndEAX+1ch],[EndEDX+1ch],\ [EndEAX+20h],[EndEDX+20h],\ [EndEAX+24h],[EndEDX+24h] invoke MessageBox,0,Buff,Tit,0 exit: invoke ExitProcess,0 ;=========================================================== section '.data' data readable writeable shareable ;----------------------------------------------------------- StartEAX rd 10 StartEDX rd 10 EndEAX rd 10 EndEDX rd 10 if RunTest = 1 if UseLoop = 1 Tit db 'Test A with LOOP',0 else Tit db 'Test A with DEC',0 end if else if UseLoop = 1 Tit db 'Test B with LOOP',0 else Tit db 'Test B with DEC',0 end if end if wsformathex db '%08X',0 wsformat db 'Run 1 - %I64u',13,10 db 'Run 2 - %I64u',13,10 db 'Run 3 - %I64u',13,10 db 'Run 4 - %I64u',13,10 db 'Run 5 - %I64u',13,10 db 'Run 6 - %I64u',13,10 db 'Run 7 - %I64u',13,10 db 'Run 8 - %I64u',13,10 db 'Run 9 - %I64u',13,10 db 'Run 10- %I64u',13,10 db 0 wsformat2 db '%s',0 Buff rd Y ;=========================================================== section '.idata' import data readable writeable ;----------------------------------------------------------- library kernel32,'KERNEL32.DLL',\ user32,'USER32.DLL' include 'api/kernel32.inc' include 'api/user32.inc' ;=========================================================== |
|||
29 Sep 2010, 15:27 |
|
ouadji 29 Sep 2010, 15:51
thank you Alphonso, and with "rep stosd" ? |
|||
29 Sep 2010, 15:51 |
|
Alphonso 30 Sep 2010, 05:00
Code: --------------------------- With rep stosd --------------------------- Run 1 - 658470 Run 2 - 74239 Run 3 - 70856 Run 4 - 70873 Run 5 - 70873 Run 6 - 70882 Run 7 - 70881 Run 8 - 70864 Run 9 - 70873 Run 10- 70882 |
|||
30 Sep 2010, 05:00 |
|
bitRAKE 30 Sep 2010, 06:24
Code: Y = 1 shl 26 else if RunTest = 3 ;------------------------ pxor mm0,mm0 mov ecx,Y/2 @@: movntq [edi+ecx*8-8],mm0 loop @B end if ;------------------------------------- align 64 Buff rd Y (LOOP/DEC do not matter because code is waiting on bus.) (Also, note REP STOSB not optimized within CPU.) |
|||
30 Sep 2010, 06:24 |
|
Alphonso 30 Sep 2010, 12:24
bitRAKE wrote:
The above code is still faster with dec for me by ~30%. Xmm works well though and loop in that case does not seem to impact the result using Y=1 shl 26. What about the cpu store buffer, I wonder how much of an effect that might have and in the case of a system with hyperthreading would using a quicker dec over loop leave more time for the execution units of the other thread? |
|||
30 Sep 2010, 12:24 |
|
bitshifter 30 Sep 2010, 13:06
rdtsc is ok to compare code on one single PC
but will not produce same result on other types. Aslo it should be serialized first (eax=0) See Agner Fogs TSC testing utility. You really should get Agners PMC tester which measures micro-ops instead of clocks. It must be calibrated to specific chipset. I dont know of FASM port yet, maybe something we can look forward to doing (as a community) I know, nothing gets done as a community project around here but maybe we can forget our differences and build a real tool for everyone... |
|||
30 Sep 2010, 13:06 |
|
bitRAKE 01 Oct 2010, 01:59
My own opinion is that clearing a large section of memory is the wrong approach to the problem as it can be initialized at page granularity during first use. For example, I often use free lists for blocks of data because de-/allocation is trivial. It's better to pay the overhead of an exception handler on guard pages to grow memory commit size -- drawing the page into cache during initialization. There is a little more cache trashing (when page granularity does not align with usage pattern), but no massive pause at start of application.
|
|||
01 Oct 2010, 01:59 |
|
Goto page Previous 1, 2 < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.