flat assembler
Message board for the users of flat assembler.

Index > Main > which is the fastest ?

Goto page Previous  1, 2
Author
Thread Post new topic Reply to topic
edfed



Joined: 20 Feb 2006
Posts: 4242
Location: 2018
edfed
c'est vrai que c'est etonant. mais avec XP, rien ne m'etonne.
XP bloque pratiquement tout ce qui est DOS/BIOS. ça ne m'etonnerais pas qu'il bloque certains jeux d'instrucitons pour les applications en V86
Post 24 Sep 2010, 20:24
View user's profile Send private message Visit poster's website Reply with quote
ouadji



Joined: 24 Dec 2008
Posts: 1081
Location: Belgium
ouadji

Le Q6600 (Kentsfield) supporte jusqu'au jeux d'instruction sse3 et ssse3
j'ai vérifié "en réel" avec un : PMADDUBSW xmm5,dqword[ebx]
et pas d'problème, "ça" passe. Donc, oui, XP qui "bloque" ... possible.

_________________
I am not young enough to know everything (Oscar Wilde)- Image
Post 24 Sep 2010, 21:21
View user's profile Send private message Send e-mail Reply with quote
ouadji



Joined: 24 Dec 2008
Posts: 1081
Location: Belgium
ouadji
j'ai testé.
une boucle de 32 mouvements
mais .... elle-même dans une boucle de 100.000.000 !
histoire de pouvoir comparer au chrono.
bien entendu ...
ecx avec "stosd"
et ecx/4 avec movntqd
Code:
stosd
-----
    mov  ecx,100000000
loop_test:
    push ecx

    xor  eax,eax
    mov  ecx,Dim_RAZ_dword
    mov  edi,address_global_library
rep stosd

    pop ecx
    loop loop_test

;------------------------------------
movntqd
---------
     mov ecx,100000000
loop_test:
     push ecx
       
     pxor xmm0,xmm0
     mov edi,address_global_library
     mov ecx,Dim_RAZ_dword/4
@@:  movntdq [edi],xmm0
     add edi,16
     loop @B
    
     pop ecx
     loop loop_test
    

il n'y a absolument pas photo
"stosd" est beaucoup plus rapide (même avec "ecx en entier")
le code avec "stosd" est 4 à 5 fois plus rapide que le code avec "movntqd".
Toutes ces instructions utilisant les registres mmx et xmm sont super puissantes, mais elles bouffent un sacré paquet de cycles horloge !

_________________
I am not young enough to know everything (Oscar Wilde)- Image
Post 24 Sep 2010, 22:07
View user's profile Send private message Send e-mail Reply with quote
edfed



Joined: 20 Feb 2006
Posts: 4242
Location: 2018
edfed
je pense que tu es sur la bonne route.
de toute façon, le meilleur code en X86, c'est celui qui utilise principalement les registres généraux. les machines X86 sont optimisées pour l'execution des instructions generales (jcc, div, imul,...) , et plus particulierement, les instrucitons RISC (add, neg, not, and,...). ce qui est logique.

pour les registres SSE etc, le seul point ou ils sont performants, c'est entre eux, lorsqu'on execute que des instructions de calculs impliquant deux registres. si la mémoire est utilisée, c'est mort j'ai l'impression.

en 64 bits, j'en ai aucune idée par contre.
Post 24 Sep 2010, 22:38
View user's profile Send private message Visit poster's website Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 294
Alphonso
FWIW for me on Core2 test 2 (B) is fastest and dec ecx, jnz is quicker than loop.

Code:
---------------------------      ---------------------------
Test A with LOOP                 Test B with LOOP
---------------------------      ---------------------------
Run 1 - 1152609                  Run 1 - 688117
Run 2 - 566806                   Run 2 - 283381
Run 3 - 566703                   Run 3 - 283364
Run 4 - 628949                   Run 4 - 283365
Run 5 - 566755                   Run 5 - 283365
Run 6 - 566695                   Run 6 - 283364
Run 7 - 566695                   Run 7 - 283364
Run 8 - 566695                   Run 8 - 284640
Run 9 - 589662                   Run 9 - 283365
Run 10- 566746                   Run 10- 283364

---------------------------      ---------------------------
Test A with DEC                  Test B with DEC
---------------------------      ---------------------------
Run 1 - 682456                   Run 1 - 616020
Run 2 - 272765                   Run 2 - 137827
Run 3 - 272714                   Run 3 - 137751
Run 4 - 272714                   Run 4 - 137776
Run 5 - 272714                   Run 5 - 137751
Run 6 - 272714                   Run 6 - 137751
Run 7 - 272722                   Run 7 - 137751
Run 8 - 272774                   Run 8 - 137751
Run 9 - 274890                   Run 9 - 137751
Run 10- 533222                   Run 10- 137751
    



Run under Vista
Code:
format PE GUI 4.0
entry start
include 'win32a.inc'

  Y       = 100000
  RunTest = 1
  UseLoop = 1

;===========================================================
section '.text' code readable executable
;-----------------------------------------------------------
  start:
            invoke  SetProcessAffinityMask,-1,1
            invoke  Sleep,10
            xor     ebx,ebx

  Again:
            cld
            mov     edi,Buff
            rdtsc
            mov     [StartEAX+ebx*4],eax
            mov     [StartEDX+ebx*4],edx

;------------------------
if RunTest = 1
            xor     eax,eax
            mov     ecx,Y
align 8
  A:
            stosd
      if UseLoop = 1
            loop    A
         else
            dec     ecx
            jnz     A
      end if

   else
;------------------------
            xor     eax,eax
            movd    mm1,eax
            mov     ecx,Y/2

align 8
B:          movq    [edi],mm1
            add     edi,8
      if UseLoop = 1
            loop    B
         else
            dec     ecx
            jnz     B
      end if
end if
;------------------------

            rdtsc
            mov     [EndEAX+ebx*4],eax
            mov     [EndEDX+ebx*4],edx
            inc     ebx
            cmp     ebx,10
            jb      Again

            xor     ebx,ebx
Next:
            mov     eax,[StartEAX+ebx*4]
            mov     edx,[StartEDX+ebx*4]
            sub     [EndEAX+ebx*4],eax
            sbb     [EndEDX+ebx*4],edx
            inc     ebx
            cmp     ebx,10
            jb      Next

            cinvoke wsprintf,Buff,wsformat,\
                    [EndEAX],[EndEDX],\
                    [EndEAX+4],[EndEDX+4],\
                    [EndEAX+8],[EndEDX+8],\
                    [EndEAX+0ch],[EndEDX+0ch],\
                    [EndEAX+10h],[EndEDX+10h],\
                    [EndEAX+14h],[EndEDX+14h],\
                    [EndEAX+18h],[EndEDX+18h],\
                    [EndEAX+1ch],[EndEDX+1ch],\
                    [EndEAX+20h],[EndEDX+20h],\
                    [EndEAX+24h],[EndEDX+24h]

            invoke  MessageBox,0,Buff,Tit,0
exit:
            invoke  ExitProcess,0

;===========================================================
section '.data' data readable writeable shareable
;-----------------------------------------------------------
  StartEAX              rd 10
  StartEDX              rd 10
  EndEAX                rd 10
  EndEDX                rd 10


  if RunTest = 1

     if UseLoop = 1
                        Tit  db 'Test A with LOOP',0
        else
                        Tit  db 'Test A with DEC',0
     end if

  else

     if UseLoop = 1
                        Tit  db 'Test B with LOOP',0
        else
                        Tit  db 'Test B with DEC',0
     end if
  end if

  wsformathex           db '%08X',0
  wsformat              db 'Run 1 - %I64u',13,10
                        db 'Run 2 - %I64u',13,10
                        db 'Run 3 - %I64u',13,10
                        db 'Run 4 - %I64u',13,10
                        db 'Run 5 - %I64u',13,10
                        db 'Run 6 - %I64u',13,10
                        db 'Run 7 - %I64u',13,10
                        db 'Run 8 - %I64u',13,10
                        db 'Run 9 - %I64u',13,10
                        db 'Run 10- %I64u',13,10
                        db 0
  wsformat2             db '%s',0
  Buff                  rd Y

;===========================================================
section '.idata' import data readable writeable
;-----------------------------------------------------------
     library kernel32,'KERNEL32.DLL',\
             user32,'USER32.DLL'

             include 'api/kernel32.inc'
             include 'api/user32.inc'

;===========================================================
    
Post 29 Sep 2010, 15:27
View user's profile Send private message Reply with quote
ouadji



Joined: 24 Dec 2008
Posts: 1081
Location: Belgium
ouadji

thank you Alphonso,

and with "rep stosd" ?

_________________
I am not young enough to know everything (Oscar Wilde)- Image
Post 29 Sep 2010, 15:51
View user's profile Send private message Send e-mail Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 294
Alphonso
Code:
---------------------------
With rep stosd
---------------------------
Run 1 - 658470
Run 2 - 74239
Run 3 - 70856
Run 4 - 70873
Run 5 - 70873
Run 6 - 70882
Run 7 - 70881
Run 8 - 70864
Run 9 - 70873
Run 10- 70882    
Post 30 Sep 2010, 05:00
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3055
Location: vpcmipstrm
bitRAKE
Code:
  Y       = 1 shl 26


else if RunTest = 3 ;------------------------
    pxor mm0,mm0
    mov ecx,Y/2
@@: movntq [edi+ecx*8-8],mm0
    loop @B
end if ;-------------------------------------


  align 64
  Buff                  rd Y    
Move outside of cache and MOVNTQ beats REP STOSD.
(LOOP/DEC do not matter because code is waiting on bus.)
(Also, note REP STOSB not optimized within CPU.)
Post 30 Sep 2010, 06:24
View user's profile Send private message Visit poster's website Reply with quote
Alphonso



Joined: 16 Jan 2007
Posts: 294
Alphonso
bitRAKE wrote:

(LOOP/DEC do not matter because code is waiting on bus.)

The above code is still faster with dec for me by ~30%. Xmm works well though and loop in that case does not seem to impact the result using Y=1 shl 26.

What about the cpu store buffer, I wonder how much of an effect that might have and in the case of a system with hyperthreading would using a quicker dec over loop leave more time for the execution units of the other thread?
Post 30 Sep 2010, 12:24
View user's profile Send private message Reply with quote
bitshifter



Joined: 04 Dec 2007
Posts: 764
Location: Massachusetts, USA
bitshifter
rdtsc is ok to compare code on one single PC
but will not produce same result on other types.
Aslo it should be serialized first (eax=0)
See Agner Fogs TSC testing utility.
You really should get Agners PMC tester which
measures micro-ops instead of clocks.
It must be calibrated to specific chipset.
I dont know of FASM port yet, maybe something
we can look forward to doing (as a community)
I know, nothing gets done as a community project around here
but maybe we can forget our differences and build a real tool for everyone...
Post 30 Sep 2010, 13:06
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 3055
Location: vpcmipstrm
bitRAKE
My own opinion is that clearing a large section of memory is the wrong approach to the problem as it can be initialized at page granularity during first use. For example, I often use free lists for blocks of data because de-/allocation is trivial. It's better to pay the overhead of an exception handler on guard pages to grow memory commit size -- drawing the page into cache during initialization. There is a little more cache trashing (when page granularity does not align with usage pattern), but no massive pause at start of application.
Post 01 Oct 2010, 01:59
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.