flat assembler
Message board for the users of flat assembler.

Index > Main > curious FLOPS issue

Author
Thread Post new topic Reply to topic
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 17 Mar 2010, 23:56
So I was testing out the 4 DP FLOP/clock claim I stumbled upon something intersting:

Code:
format PE64 GUI
entry start

include 'win64a.inc'

section '.text' code readable executable

start:
        rdtsc
        mov     dword[Cycles],eax
        mov     dword[Cycles+4],edx

        movaps  xmm0,dqword[r]
        movaps  xmm1,dqword[rinv]
      irps i, 2 3 4 5 6 7 8 9 10 11 12 13 14 15 {
        movaps xmm#i,dqword[r]  }


        mov     eax,1000000
        mov     ecx,0
align 16
    @@:
     rept 4 {
        addpd   xmm2,xmm0
        mulpd   xmm3,xmm0
        addpd   xmm4,xmm0
        mulpd   xmm5,xmm0
        addpd   xmm6,xmm0
        mulpd   xmm7,xmm0
        addpd   xmm8,xmm0
        mulpd   xmm9,xmm0
        addpd   xmm10,xmm0
        mulpd   xmm11,xmm0
   ;     addpd   xmm12,xmm0
   ;     mulpd   xmm13,xmm0            ; comment out lines
   ;     addpd   xmm14,xmm0           ; here
   ;     mulpd   xmm15,xmm0

        subpd   xmm2,xmm0
        mulpd   xmm3,xmm1
        addpd   xmm4,xmm0
        mulpd   xmm5,xmm1
        addpd   xmm6,xmm0
        mulpd   xmm7,xmm1
        addpd   xmm8,xmm0
        mulpd   xmm9,xmm1
        addpd   xmm10,xmm0
        mulpd   xmm11,xmm1
   ;     addpd   xmm12,xmm0
   ;     mulpd   xmm13,xmm1
   ;     addpd   xmm14,xmm0           ; and here
   ;     mulpd   xmm15,xmm1

     }
        add     rcx,2*4*4*5            ; and change the last number here accordingly
        sub     eax,1
        jnz     @b

        rdtsc
        shl     rdx,32
        or      rdx,rax
        sub     [Cycles],rdx
        mov     qword[FLOPs],rcx
        fild    qword[FLOPs]
        fild    qword[Cycles]
        fdivp   st1,st0
        fchs
        fstp    qword[FLOPs]

        invoke  sprintf,Message,MessageFormat,qword[FLOPs]
        invoke  MessageBoxA,0,Message,Caption,MB_OK
        invoke  ExitProcess,0

section '.data' data readable writeable

        r       dq 1.1,1.1
        rinv    dq 0.9090909090909090909,0.9090909090909090909

        Cycles  dq 0
        FLOPs   dq 0

  MessageFormat db 'FLOPs per cycle: %f',0
  Caption  db  'Solution:',0
  Message  db  'ab',0

section '.idata' import data readable writeable

  library kernel32,'KERNEL32.DLL',\
          user32,'USER32.DLL',\
          msvcrt,'MSVCRT.DLL'

  include 'api\kernel32.inc'
  include 'api\user32.inc'


  import msvcrt,\
         sprintf,'sprintf'     


Here are the results for different numbers of add/mul pairs in the loop
(so 6 would correspond to deleting the lines containing xmm14 and xmm15):

3: 2.37
4: 3.18
5: 3.97
6: 1.95
7: 1.96

The good news is that we can get close to 4. The bad news is that the extra stuff here kills it !!!!!!!??????

I know the rdtsc is not done correctly, but the loop in between is big enough to send the % error close to zero.
Post 17 Mar 2010, 23:56
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20335
Location: In your JS exploiting you and your system
revolution 18 Mar 2010, 06:27
Which CPU? Which OS?
Post 18 Mar 2010, 06:27
View user's profile Send private message Visit poster's website Reply with quote
baldr



Joined: 19 Mar 2008
Posts: 1651
baldr 18 Mar 2010, 13:57
tthsqe,

I think, performance tests should be done in OS-independent way, from special boot sector for example. TSC is OK, but PMCs can give some insight on code execution. The only question is, who will bother to write such framework? Wink

Isn't rept 15-2+1 i:2 { movaps xmm#i, dqword[r] } more readable than irps i, 2 3 4 5 6 7 8 9 10 11 12 13 14 15 { movaps xmm#i,dqword[r] }?
Post 18 Mar 2010, 13:57
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 19 Mar 2010, 05:51
Core 2 Quad Q8200 2.3 GHz, 64bit Vista.
The cpu seems to be choking - maybe there is such a thing as too much ILP?
It would be interesting to see if the i7 has the same issue ...
Post 19 Mar 2010, 05:51
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20335
Location: In your JS exploiting you and your system
revolution 19 Mar 2010, 06:08
tthsqe: Speed optimisation is really hard. Seriously, it is no simple matter. The internals of the CPUs are very complex. It is unlikely that the effect you see is simply a matter of "too much ILP". There are many mechanisms inside the CPU that can do things with small loops to make them use less bandwidth in the decoder and the cache, thus allowing more bandwidth for other things. Once your loop became larger than a certain size then those mechanisms no longer work and other factors like cache efficiency and decoder speed become more prominent.

This is why when optimising any code you have to say for which CPU, which OS, what data ranges and so on. But other things also matter like SDRAM timings and number of channels, clock multipliers, heatsink efficiency, other processes/tasks running, etc.

My guess would be that you have hit the decoder limit for SSE instructions, probably close to one instruction per cycle for code that is in L1 cache and is larger than the decoder prefetch queue.
Post 19 Mar 2010, 06:08
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 19 Mar 2010, 06:37
I don't see how a decoder limit would cause the steady increase until 5 and then the sudden drop off. ??
The code is virtually identical - the only thing varying is the number of independed tasks per block
Post 19 Mar 2010, 06:37
View user's profile Send private message Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20335
Location: In your JS exploiting you and your system
revolution 19 Mar 2010, 06:40
Well the decoder is not as simple is it might first appear. It has a lot of internal optimisations that can work with small loops efficiently. Larger loops drop back to the default throughput.
Post 19 Mar 2010, 06:40
View user's profile Send private message Visit poster's website Reply with quote
hopcode



Joined: 04 Mar 2008
Posts: 563
Location: Germany
hopcode 19 Mar 2010, 10:56
tthsqe wrote:
Core 2 Quad Q8200 2.3 GHz, 64bit Vista.

~ Like me
Code:
       Number of cores         4 (max 4)
   Number of threads       4 (max 4)
   Name                    Intel Core 2 Quad Q8300
     Codename                Yorkfield
   Specification           Intel(R) Core(TM)2 Quad CPU    Q8300  @ 2.50GHz
     Package (platform ID)   Socket 775 LGA (0x4)
        CPUID                   6.7.A
       Extended CPUID          6.17
        Core Stepping           R0
  Technology              45 nm
       Core Speed              2000.0 MHz
  Multiplier x FSB        6.0 x 333.3 MHz
     Rated Bus speed         1333.4 MHz
  Stock frequency         2500 MHz
    Instructions sets       MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, EM64T
  L1 Data cache           4 x 32 KBytes, 8-way set associative, 64-byte line size
     L1 Instruction cache    4 x 32 KBytes, 8-way set associative, 64-byte line size
     L2 cache                2 x 2048 KBytes, 8-way set associative, 64-byte line size
   FID/VID Control         yes
 FID range               6.0x - 7.5x
 Max VID                 1.250 V
    

It is important having not a shared cache.
My results are similiar to yours. But the rdtsc in your code is very messing.
For example, to have a delta variance only ~ 100% the value, like in the following results
Code:
3 ---- 0,76
4 ---- 1,03
5 ---- 1,29
6 ---- 1,52
7 ---- 1,78
    

put this instruction here
Code:
   @@:
     rept 4 {
     emms ;<--------- this instruction
     addpd   xmm2,xmm0
;    ....  rest
    

I have a recipe, but i will not publish it until i have found improvements on it.

Cheers, Very Happy
hopcode
Post 19 Mar 2010, 10:56
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 19 Mar 2010, 14:38
What should the bootable framework look like?

For example you could just run your code and print out the results in Real Mode, can you not?

The advanced variant of this bootable framework is of course some shell where you can write fancy things like "run", "help" and "bench test_3a -threads=3" Razz
Post 19 Mar 2010, 14:38
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.