flat assembler
Message board for the users of flat assembler.

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous  1, 2, 3 ... 16, 17, 18, 19, 20  Next
Author
Thread Post new topic Reply to topic
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 09 Jan 2009, 12:30
Yes I'm back, though it has been long.
My T9300 laptop has improved scores with the newest benchmark:
Code:
709.060 FPU
2 cores
Intel T9300 @ 2.5GHz

1678.300 SSE2
2 cores
Intel T9300 @ 2.5GHz
    

I just have to get my home i7 920 to work and test your bench there!
Post 09 Jan 2009, 12:30
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
dacid



Joined: 31 Aug 2008
Posts: 57
dacid 09 Jan 2009, 14:22
I recently change my 4200+ to 6000+

AMD Athlon(tm) 64 X2 Dual Core Processor 6000+

KMB_V0.53I-32b-MT_SSE2
Speed [Million Iterations / Second] : 1018.479
Logical CPU cores detected : 2

KMB_V0.53I-32b-MT_FPU
Speed [Million Iterations / Second] : 696.017
Logical CPU cores detected : 2
Post 09 Jan 2009, 14:22
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 18 Feb 2009, 16:06
I finally got my i7 working again:
* MoBo replaced DX58SO (I hope I didn't kill the previous one)
* RAM changed (I hope my previous 2.1V RAM isn't to be blamed) to 8-8-8-24-2T Corsair 1600MHz 1.65V
* case changed so GTX 260 can fit in

Now the tests:
Code:
Server 2008 Enterprise (x64) ~ basically a Vista 64
6GB triple-DDR3 @ 1600MHz

1596,746 FPU
4 cores / 8 threads
Intel Core i7 920 @ 2.66GHz

4524,112
4 cores / 8 threads
Intel Core i7 920 @ 2.66GHz
    


They say everywhere that when you go beyond 1.65V you might hurt the CPU, but actually it feels like I've hurt the MoBo because the old DX58SO doesn't boot anymore (spins the fans, but no video). It doesn't even beep errors when no vid or no RAM...strange. New one is working fine though.
Post 18 Feb 2009, 16:06
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 20 Feb 2009, 18:03
Hi Madis731,

got mine also running now, though there were some issues with my Asus P6T motherboard as well...these expensive i7 motherboards are still a little beta-style I think, not the quality I used to know from Asus.

Interesting is the behaviour of HT, the FPU loses a bit, the SSE2 wins a lot:

Intel Core i7 920 (HT on) at 3200 MHz
FPU: 1820,197 MIter/s (Efficiency: 142,2) / SSE2: 5138,498 MIter/s (Efficiency: 401,4)
Intel Core i7 920 (HT off) at 3200 MHz
FPU: 1869,244 MIter/s (Efficiency: 146,0) / SSE2: 4573,828 MIter/s (Efficiency: 357,3)

I'll start soon checking out if SSE4 can give some benefit.
Post 20 Feb 2009, 18:03
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 07 Jun 2009, 15:16
Hi guys,

after having some trouble implementing the 'PTEST' command and nicely found some solution in the forum I released a SSE4.1 version of my old code, you'll find it at http://www.mikusite.de/x86/KMB_V0.53I-32b-MT.zip

It delivers about 2% of performance gain on my i7:

Intel Core i7 920 (HT on) at 3200 MHz
FPU: 1820,197 MIter/s (Efficiency: 142,2) / SSE2: 5224,498 MIter/s (Efficiency: 408,2)

Does anyone got an Intel C2D or C2Q SSE4.1 capable CPU to confirm the gain on these CPU's ?

I guess the next big speed up will have to wait untel we'll have these SSE5/AVX-whatever it's called CPU's with ymm's (256bit registers) and MULADD/MULSUB in 2010 or 2011 Wink
Post 07 Jun 2009, 15:16
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 07 Jun 2009, 15:33
On X58 Intel board
BIOS Version/Date: SOX5810J.86A.4014.2009.0507.1930, 7.05.2009
Code:
i7 920 {HT:Enabled; Turbo:Enabled}

FPU:  1596,746
SSE2:   4524,112
SSE4.1: 4573,828
    

Not much, but at least something Smile
Post 07 Jun 2009, 15:33
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4020
Location: vpcmpistri
bitRAKE 07 Jun 2009, 16:05
Code:
Kümmel Mandelbrot Benchmark V 0.53I-32b-MT_FPU
Speed [Million Iterations / Second] : 2708.579   

Kümmel Mandelbrot Benchmark V 0.53I-32b-MT_SSE2
Speed [Million Iterations / Second] : 6274.648   

Kümmel Mandelbrot Benchmark V 0.53I-32b-MT_SSE4.1
Speed [Million Iterations / Second] : 6091.000   

Logical CPU cores detected : 8  
CPU Brand detected : Intel(R) Xeon(R) CPU           L5410  @ 2.33GHz    
The SSE4.1 code is a little slower on my machine.

No plains for a 64-bit version, huh?

(Someone really has two L5450's? I couldn't find them anywhere, but your results listing.)
Post 07 Jun 2009, 16:05
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 07 Jun 2009, 17:42
bitRAKE wrote:
The SSE4.1 code is a little slower on my machine.

No plains for a 64-bit version, huh?

(Someone really has two L5450's? I couldn't find them anywhere, but your results listing.)
Hm, so it's slower there...may be the hyper threading can use that better...hm...seems to get hard even in the Intel universe to optimize...don't have a clue...

Regarding the 64-Bit there's still the one from Xorpd! ->
http://xorpd.home.comcast.net/~xorpd/
I could also try to implement the 'ptest' there if he's not active any more. But I got a hard time to compile it...he's refering on some old SDK needed which I couldn't find any more...any help, or anybody send me the setup with the needed files with that I can assemble and link it ?

Regarding the 5450's, no idea...I think somebody on http://forums.2cpu.com got it...
Post 07 Jun 2009, 17:42
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 08 Jun 2009, 07:04
C2D T9300 @2.5GHz
Code:
FPU:  709.060
SSE2:   1673.800
SSE4.1: 1623.739
    

Oh no, slower Razz
I think PTEST is not your savior here because its not your average 1uop instruction. Actually it takes 2uops fused. PCMPxxx is actually better because its a 1uop instruction and furthermore its throughput is every 0.5 clock another instruction (when PCMPxxx xmm,xmm).
When you are worried about the ALU<=>FPU conversions then CMPxxPD is your better bet, but its throughput is the usual 1 and uops also takes 1.

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 08 Jun 2009, 07:04
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 08 Jun 2009, 17:36
...yes, I knew that, just was hoping as I saved theoretically some CMP in the inner loop Wink ...because I got a sequence like
CMPLEPD
MOVMSKPD
CMP
JNE
...replaced by
SUBPD
PTEST
JNC

...but to be able to do that I needed to reorder and rename some registers and instructions what the C2D seems not to like...okay at least the Hyper Threading seems to be capable to use it a bit I guess as it fuses the gaps in the instructions lines...
Post 08 Jun 2009, 17:36
View user's profile Send private message Visit poster's website Reply with quote
kalambong



Joined: 08 Nov 2008
Posts: 165
kalambong 13 Oct 2009, 06:50
hmm... any new development?
Post 13 Oct 2009, 06:50
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 14 Oct 2009, 07:16
I was just looking over the code for "ALGO=1" and it seems that it contains a few ineficiencies. To recall, we are doing
T=X^2-Y^2+x;
Y=2*X*Y+y;
X=T;
where (x,y) is the point and (X,Y) is the sequence of iterations. I would think this can be done more efficiently as:
Code:
suppose
xmm1 contains X
xmm3 contains Y
        movaps  xmm7,xmm2      ; m7 = X
        mulpd   xmm2,xmm2      ; X = X^2
        movaps  xmm6,xmm1      ; m6 = Y
        mulpd   xmm6,xmm1      ; m6 = Y^2
        addpd   xmm1,xmm1      ; Y = Y + Y
        subpd   xmm2,xmm6      ; X = X - Y^2
        addpd  xmm2,qword[x]  ; X = X + x
        mulpd   xmm1,xmm7      ; Y = Y*X
        addpd   xmm1,qword[y]  ; Y = Y + y       

Notice that this this contains 2 fewer mov's, and 1 fewer mul, while no two intructions of the same type appear consecutively, and it trashes only two other registers, so it could be interlaced with another set of pixels to reduce the data dependencies. If you do make these changes, could you let me know how it affects the performance?
Post 14 Oct 2009, 07:16
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 14 Oct 2009, 07:18
oops, I clearly meant
xmm2 contains X
xmm1 contains Y
Post 14 Oct 2009, 07:18
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 14 Oct 2009, 09:19
I'm not quite sure what is happening in the fpu section, but it looks like it can be improved similarly. It looks like you might how them interlaced, but don't the xchng's not help the fpu unit?
I haven't found a way to eliminate the xchg when checking for divergence, i.e.
Code:
        fld     [four];4 
        fld     [x]     ;4,x 
        fld     [y]     ;4,x,y 
        fld     st1     ;4,x,y,X 
        fld     st1     ;4,x,y,X     Y 
again: 
        fld     st1     ;4,x,y,X     Y       X 
        fmul    st0,st2 ;4,x,y,X     Y       X*X 
        fld     st1     ;4,x,y,X     Y       X*X     Y 
        fmul    st0,st2 ;4,x,y,X     Y       X*X     Y*Y 
        fld     st0     ;4,x,y,X     Y       X*X     Y*Y     Y*Y 
        fadd    st0,st2 ;4,x,y,X     Y       X*X     Y*Y     X*X+Y*Y 
        fcomip  st7     ;4,x,y,X     Y       X*X     Y*Y 
        ja      done 
        fsubp   st1,st0 ;4,x,y,X     Y       X*X-Y*Y 
        fadd    st0,st3 ;4,x,y,X     Y       X*X-Y*Y+x 
        fxch    st0,st1 ;4,x,y,newx  Y       X 
        fmulp   st1,st0 ;4,x,y,newX  X*Y 
        fadd    st0,st0 ;4,x,y,newX  2*X*Y 
        fadd    st0,st2 ;4,x,y,newX  newY 
        jmp     again    

but the pure iteration can be done with only two moves (like sse):
Code:
   fld   st1     ;4,x,y,X     Y  X
   fmul  st2,st0 ;4,x,y,X*X   Y  X
   fld     st1     ;4,x,y,X*X   Y  X  Y
   fmul  st0,st2 ;4,x,y,X*X   Y  X  Y*Y
   fsub  st0,st5 ;4,x,y,X*X   Y  X  Y*Y-x
   fsubp st3,st0 ;4,x,y,newX  Y  X
   fmulp st1,st0 ;4,x,y,newX  X*Y
   fadd  st0,st0 ;4,x,y,newX  2*X*Y
   fadd  st0,st2 ;4,x,y,newX  newY    
Post 14 Oct 2009, 09:19
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 14 Oct 2009, 09:25
Oh, never mind - the first code can be written like the second to eliminate the exchange. It can be done.
Post 14 Oct 2009, 09:25
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 14 Oct 2009, 17:52
tthsqe wrote:
I was just looking over the code for "ALGO=1" and it seems that it contains a few ineficiencies. To recall, we are doing
T=X^2-Y^2+x;
Y=2*X*Y+y;
X=T;
where (x,y) is the point and (X,Y) is the sequence of iterations. I would think this can be done more efficiently as:

Hi tthsqe,

I had a short look at your SSE2 code and try to fit it to my first sequence in the main loop, see below. I think the major part missing in your code is the calculation of X^2 + Y^2 and to save it (boundary check). I have to do it to be really 'correct' with when the iteration should end in the loop unrolling. Until now I didn't find a convenient way to implement it in your code sequence...as it 'needs' xmm1 and xmm0 'later' than mine...any suggestions ?

Regarding the FPU, you have to pay attention on the stack size. As I'm iterating always 3 points, there's limited space. Look at '.continue_main_loop_entry_1:' to see the stack overview in the main loop, I think yours is using too much of the stack to keep the rest of the points saved.
Code:
.continue_main_loop_entry_12:

 kuemmel:

         movaps  xmm3, [.two_two]                ;    2.0    |    2.0                    ; 1st iter point 1 and 2
            mulpd   xmm3, xmm1                      ;   2*iz    |   2*iz
                mulpd   xmm1, xmm1                      ;   iz^2    |   iz^2
                mulpd   xmm3, xmm0                      ;  2*rz*iz  |  2*rz*iz
              mulpd   xmm0, xmm0                      ;   rz^2    |   rz^2
                movaps  xmm2, xmm0                      ;   rz^2    |   rz^2
                subpd   xmm0, xmm1                     ; rz^2-iz^2 | rz^2-iz^2
              addpd   xmm1, xmm2                      ; rz^2+iz^2 | rz^2+iz^2
             addpd   xmm0, [.local_rz0_12] 
              movaps  [.state_of_12_a],xmm1
               movaps  xmm1, [.local_iz0]
          addpd   xmm1, xmm3

tthsqe:

        movaps  xmm3,xmm0             ;  rz        rz
        mulpd   xmm0,xmm0             ;  rz^2             rz^2
        movaps  xmm2,xmm1             ;  iz             iz
        mulpd   xmm2,xmm1             ;  iz^2             iz^2            
        addpd   xmm1,xmm1             ;  2*iz               2*iz
        subpd   xmm0,xmm2                 ;  rz^2 - iz^2        rz^2 - iz^2
         addpd   xmm0,[.local_rz0_12]  ;  new rz + rd
        mulpd   xmm1,xmm3             ;  2*iz*rz        2*iz*rz
     addpd   xmm1,[.local_iz0]     ;  new iz + id             
    
Post 14 Oct 2009, 17:52
View user's profile Send private message Visit poster's website Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 14 Oct 2009, 20:31
Oh, I failed to realize that you were calculating rz+^2 + iz^2 at every step. If you want to calculate EXACTLY when the iteration gets out of the circle then it looks like you need 12 instructions per iteration. Are you really against calculating rz+^2 + iz^2 every, let's say, 4th iteration?
I'll see if the initial multiply by 2.0 can be replaced with an addition.
Post 14 Oct 2009, 20:31
View user's profile Send private message Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 14 Oct 2009, 21:11
Ok. it can be done in 11 instructions, and the pure iteration can be done with one less mov and one less register than I originally had.

Code:
with X^2+Y^2:
    0     1     2   3 
0:  X     Y     ?   ? 
1:  X     Y     Y   ?       mov 2,1
2:  X     XY    Y   ?       mul 1,0
3:  X     2XY   Y   ?       add 1,1
4:  X     2XY   YY  ?       mul 2,2
5:  X     newY  YY  ?       add 1,[y]
6:  XX    newY  YY  ?       mul 0,0
7:  XX    newY  YY  XX      mov 3,0
8:  XX-YY newY  YY  XX      sub 0,2
9:  newX  newY  YY  XX      add 0,[x]
10: newX  newY  YY  XX+YY   add 3,2
11: newX  newY  YY  XX+YY   mov [state],3

without:
    0     1     2    
0:  X     Y     ?       
1:  X     Y     Y      mov 2,1
2:  X     XY    Y      mul 1,0
3:  X     2XY   Y      add 1,1
4:  X     2XY   YY     mul 2,2
5:  X     newY  YY     add 1,[y]
6:  XX    newY  YY     mul 0,0
7:  XX-YY newY  YY     sub 0,2
8:  newX  newY  YY     add 0,[x]    


I doubt if anything smaller is possible.
Post 14 Oct 2009, 21:11
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 15 Oct 2009, 19:46
Hi tthsqe,

okay, maybe for the mandelbrot pictures it is not so much important to check the iteration end each pixel, but mathematically and to keep the 'rules' for optimizing I kept it like that, to really compare to my old code.

I implemented your idea now and it seems to be about 10% slower on my i7@3200 MHz. I think may be Xorpd! (who made the 64bit version) could explain a little better, but it has a lot to do with dependencies and register usage.

Now with your code the auxiliary xmm2 and xmm3 are used much more, so the benefit of different instruction lines (like the 3 pixel iteration blocks) goes down. And so minimizing instruction count isn't aways that usefull, especially not on Core 2 Duo architecture, also replacing MUL's with ADD migh not end in a better result. For me it's often try and error, some guys may be know it right away...

Find the file attached with your changes. May be another idea ?


Description:
Download
Filename: KMB_test_tthsqe.zip
Filesize: 30.91 KB
Downloaded: 244 Time(s)

Post 15 Oct 2009, 19:46
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 15 Oct 2009, 19:57
kalambong wrote:
hmm... any new development?

Hi Kalambong,

unfortunatelly not, I ran out of ideas how to optimize further Wink The only thing can be using the backtrace method of Xorpd! from the 64bit version to implement here, but I'm to lazy and it gains not that much.

I'm more looking forward to wait for the next generation architecture of Intel and AMD, using especially SSE5 or AVX or whatever is called. Also the bus of the SSE unit is wider so I expect also a benefit for my existing SSE2 code...but seems like waiting for year 2010/2011...until that stuff is out...
Post 15 Oct 2009, 19:57
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3 ... 16, 17, 18, 19, 20  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.