flat assembler
Message board for the users of flat assembler.
Index
> Windows > Mandelbrot Benchmark FPU/SSE2 released Goto page Previous 1, 2, 3 ... 16, 17, 18, 19, 20 Next 
Author 

dacid
I recently change my 4200+ to 6000+
AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ KMB_V0.53I32bMT_SSE2 Speed [Million Iterations / Second] : 1018.479 Logical CPU cores detected : 2 KMB_V0.53I32bMT_FPU Speed [Million Iterations / Second] : 696.017 Logical CPU cores detected : 2 

09 Jan 2009, 14:22 

Madis731
I finally got my i7 working again:
* MoBo replaced DX58SO (I hope I didn't kill the previous one) * RAM changed (I hope my previous 2.1V RAM isn't to be blamed) to 888242T Corsair 1600MHz 1.65V * case changed so GTX 260 can fit in Now the tests: Code: Server 2008 Enterprise (x64) ~ basically a Vista 64 6GB tripleDDR3 @ 1600MHz 1596,746 FPU 4 cores / 8 threads Intel Core i7 920 @ 2.66GHz 4524,112 4 cores / 8 threads Intel Core i7 920 @ 2.66GHz They say everywhere that when you go beyond 1.65V you might hurt the CPU, but actually it feels like I've hurt the MoBo because the old DX58SO doesn't boot anymore (spins the fans, but no video). It doesn't even beep errors when no vid or no RAM...strange. New one is working fine though. 

18 Feb 2009, 16:06 

Kuemmel
Hi Madis731,
got mine also running now, though there were some issues with my Asus P6T motherboard as well...these expensive i7 motherboards are still a little betastyle I think, not the quality I used to know from Asus. Interesting is the behaviour of HT, the FPU loses a bit, the SSE2 wins a lot: Intel Core i7 920 (HT on) at 3200 MHz FPU: 1820,197 MIter/s (Efficiency: 142,2) / SSE2: 5138,498 MIter/s (Efficiency: 401,4) Intel Core i7 920 (HT off) at 3200 MHz FPU: 1869,244 MIter/s (Efficiency: 146,0) / SSE2: 4573,828 MIter/s (Efficiency: 357,3) I'll start soon checking out if SSE4 can give some benefit. 

20 Feb 2009, 18:03 

Kuemmel
Hi guys,
after having some trouble implementing the 'PTEST' command and nicely found some solution in the forum I released a SSE4.1 version of my old code, you'll find it at http://www.mikusite.de/x86/KMB_V0.53I32bMT.zip It delivers about 2% of performance gain on my i7: Intel Core i7 920 (HT on) at 3200 MHz FPU: 1820,197 MIter/s (Efficiency: 142,2) / SSE2: 5224,498 MIter/s (Efficiency: 408,2) Does anyone got an Intel C2D or C2Q SSE4.1 capable CPU to confirm the gain on these CPU's ? I guess the next big speed up will have to wait untel we'll have these SSE5/AVXwhatever it's called CPU's with ymm's (256bit registers) and MULADD/MULSUB in 2010 or 2011 

07 Jun 2009, 15:16 

Madis731
On X58 Intel board
BIOS Version/Date: SOX5810J.86A.4014.2009.0507.1930, 7.05.2009 Code: i7 920 {HT:Enabled; Turbo:Enabled} FPU: 1596,746 SSE2: 4524,112 SSE4.1: 4573,828 Not much, but at least something 

07 Jun 2009, 15:33 

bitRAKE
Code: Kümmel Mandelbrot Benchmark V 0.53I32bMT_FPU Speed [Million Iterations / Second] : 2708.579 Kümmel Mandelbrot Benchmark V 0.53I32bMT_SSE2 Speed [Million Iterations / Second] : 6274.648 Kümmel Mandelbrot Benchmark V 0.53I32bMT_SSE4.1 Speed [Million Iterations / Second] : 6091.000 Logical CPU cores detected : 8 CPU Brand detected : Intel(R) Xeon(R) CPU L5410 @ 2.33GHz No plains for a 64bit version, huh? (Someone really has two L5450's? I couldn't find them anywhere, but your results listing.) 

07 Jun 2009, 16:05 

Kuemmel
bitRAKE wrote: The SSE4.1 code is a little slower on my machine. Regarding the 64Bit there's still the one from Xorpd! > http://xorpd.home.comcast.net/~xorpd/ I could also try to implement the 'ptest' there if he's not active any more. But I got a hard time to compile it...he's refering on some old SDK needed which I couldn't find any more...any help, or anybody send me the setup with the needed files with that I can assemble and link it ? Regarding the 5450's, no idea...I think somebody on http://forums.2cpu.com got it... 

07 Jun 2009, 17:42 

Madis731
C2D T9300 @2.5GHz
Code: FPU: 709.060 SSE2: 1673.800 SSE4.1: 1623.739 Oh no, slower I think PTEST is not your savior here because its not your average 1uop instruction. Actually it takes 2uops fused. PCMPxxx is actually better because its a 1uop instruction and furthermore its throughput is every 0.5 clock another instruction (when PCMPxxx xmm,xmm). When you are worried about the ALU<=>FPU conversions then CMPxxPD is your better bet, but its throughput is the usual 1 and uops also takes 1. 

08 Jun 2009, 07:04 

Kuemmel
...yes, I knew that, just was hoping as I saved theoretically some CMP in the inner loop ...because I got a sequence like
CMPLEPD MOVMSKPD CMP JNE ...replaced by SUBPD PTEST JNC ...but to be able to do that I needed to reorder and rename some registers and instructions what the C2D seems not to like...okay at least the Hyper Threading seems to be capable to use it a bit I guess as it fuses the gaps in the instructions lines... 

08 Jun 2009, 17:36 

kalambong
hmm... any new development?


13 Oct 2009, 06:50 

tthsqe
I was just looking over the code for "ALGO=1" and it seems that it contains a few ineficiencies. To recall, we are doing
T=X^2Y^2+x; Y=2*X*Y+y; X=T; where (x,y) is the point and (X,Y) is the sequence of iterations. I would think this can be done more efficiently as: Code: suppose xmm1 contains X xmm3 contains Y movaps xmm7,xmm2 ; m7 = X mulpd xmm2,xmm2 ; X = X^2 movaps xmm6,xmm1 ; m6 = Y mulpd xmm6,xmm1 ; m6 = Y^2 addpd xmm1,xmm1 ; Y = Y + Y subpd xmm2,xmm6 ; X = X  Y^2 addpd xmm2,qword[x] ; X = X + x mulpd xmm1,xmm7 ; Y = Y*X addpd xmm1,qword[y] ; Y = Y + y Notice that this this contains 2 fewer mov's, and 1 fewer mul, while no two intructions of the same type appear consecutively, and it trashes only two other registers, so it could be interlaced with another set of pixels to reduce the data dependencies. If you do make these changes, could you let me know how it affects the performance? 

14 Oct 2009, 07:16 

tthsqe
oops, I clearly meant
xmm2 contains X xmm1 contains Y 

14 Oct 2009, 07:18 

tthsqe
I'm not quite sure what is happening in the fpu section, but it looks like it can be improved similarly. It looks like you might how them interlaced, but don't the xchng's not help the fpu unit?
I haven't found a way to eliminate the xchg when checking for divergence, i.e. Code: fld [four];4 fld [x] ;4,x fld [y] ;4,x,y fld st1 ;4,x,y,X fld st1 ;4,x,y,X Y again: fld st1 ;4,x,y,X Y X fmul st0,st2 ;4,x,y,X Y X*X fld st1 ;4,x,y,X Y X*X Y fmul st0,st2 ;4,x,y,X Y X*X Y*Y fld st0 ;4,x,y,X Y X*X Y*Y Y*Y fadd st0,st2 ;4,x,y,X Y X*X Y*Y X*X+Y*Y fcomip st7 ;4,x,y,X Y X*X Y*Y ja done fsubp st1,st0 ;4,x,y,X Y X*XY*Y fadd st0,st3 ;4,x,y,X Y X*XY*Y+x fxch st0,st1 ;4,x,y,newx Y X fmulp st1,st0 ;4,x,y,newX X*Y fadd st0,st0 ;4,x,y,newX 2*X*Y fadd st0,st2 ;4,x,y,newX newY jmp again but the pure iteration can be done with only two moves (like sse): Code: fld st1 ;4,x,y,X Y X fmul st2,st0 ;4,x,y,X*X Y X fld st1 ;4,x,y,X*X Y X Y fmul st0,st2 ;4,x,y,X*X Y X Y*Y fsub st0,st5 ;4,x,y,X*X Y X Y*Yx fsubp st3,st0 ;4,x,y,newX Y X fmulp st1,st0 ;4,x,y,newX X*Y fadd st0,st0 ;4,x,y,newX 2*X*Y fadd st0,st2 ;4,x,y,newX newY 

14 Oct 2009, 09:19 

tthsqe
Oh, never mind  the first code can be written like the second to eliminate the exchange. It can be done.


14 Oct 2009, 09:25 

Kuemmel
tthsqe wrote: I was just looking over the code for "ALGO=1" and it seems that it contains a few ineficiencies. To recall, we are doing Hi tthsqe, I had a short look at your SSE2 code and try to fit it to my first sequence in the main loop, see below. I think the major part missing in your code is the calculation of X^2 + Y^2 and to save it (boundary check). I have to do it to be really 'correct' with when the iteration should end in the loop unrolling. Until now I didn't find a convenient way to implement it in your code sequence...as it 'needs' xmm1 and xmm0 'later' than mine...any suggestions ? Regarding the FPU, you have to pay attention on the stack size. As I'm iterating always 3 points, there's limited space. Look at '.continue_main_loop_entry_1:' to see the stack overview in the main loop, I think yours is using too much of the stack to keep the rest of the points saved. Code: .continue_main_loop_entry_12: kuemmel: movaps xmm3, [.two_two] ; 2.0  2.0 ; 1st iter point 1 and 2 mulpd xmm3, xmm1 ; 2*iz  2*iz mulpd xmm1, xmm1 ; iz^2  iz^2 mulpd xmm3, xmm0 ; 2*rz*iz  2*rz*iz mulpd xmm0, xmm0 ; rz^2  rz^2 movaps xmm2, xmm0 ; rz^2  rz^2 subpd xmm0, xmm1 ; rz^2iz^2  rz^2iz^2 addpd xmm1, xmm2 ; rz^2+iz^2  rz^2+iz^2 addpd xmm0, [.local_rz0_12] movaps [.state_of_12_a],xmm1 movaps xmm1, [.local_iz0] addpd xmm1, xmm3 tthsqe: movaps xmm3,xmm0 ; rz rz mulpd xmm0,xmm0 ; rz^2 rz^2 movaps xmm2,xmm1 ; iz iz mulpd xmm2,xmm1 ; iz^2 iz^2 addpd xmm1,xmm1 ; 2*iz 2*iz subpd xmm0,xmm2 ; rz^2  iz^2 rz^2  iz^2 addpd xmm0,[.local_rz0_12] ; new rz + rd mulpd xmm1,xmm3 ; 2*iz*rz 2*iz*rz addpd xmm1,[.local_iz0] ; new iz + id 

14 Oct 2009, 17:52 

tthsqe
Oh, I failed to realize that you were calculating rz+^2 + iz^2 at every step. If you want to calculate EXACTLY when the iteration gets out of the circle then it looks like you need 12 instructions per iteration. Are you really against calculating rz+^2 + iz^2 every, let's say, 4th iteration?
I'll see if the initial multiply by 2.0 can be replaced with an addition. 

14 Oct 2009, 20:31 

tthsqe
Ok. it can be done in 11 instructions, and the pure iteration can be done with one less mov and one less register than I originally had.
Code: with X^2+Y^2: 0 1 2 3 0: X Y ? ? 1: X Y Y ? mov 2,1 2: X XY Y ? mul 1,0 3: X 2XY Y ? add 1,1 4: X 2XY YY ? mul 2,2 5: X newY YY ? add 1,[y] 6: XX newY YY ? mul 0,0 7: XX newY YY XX mov 3,0 8: XXYY newY YY XX sub 0,2 9: newX newY YY XX add 0,[x] 10: newX newY YY XX+YY add 3,2 11: newX newY YY XX+YY mov [state],3 without: 0 1 2 0: X Y ? 1: X Y Y mov 2,1 2: X XY Y mul 1,0 3: X 2XY Y add 1,1 4: X 2XY YY mul 2,2 5: X newY YY add 1,[y] 6: XX newY YY mul 0,0 7: XXYY newY YY sub 0,2 8: newX newY YY add 0,[x] I doubt if anything smaller is possible. 

14 Oct 2009, 21:11 

Kuemmel
Hi tthsqe,
okay, maybe for the mandelbrot pictures it is not so much important to check the iteration end each pixel, but mathematically and to keep the 'rules' for optimizing I kept it like that, to really compare to my old code. I implemented your idea now and it seems to be about 10% slower on my i7@3200 MHz. I think may be Xorpd! (who made the 64bit version) could explain a little better, but it has a lot to do with dependencies and register usage. Now with your code the auxiliary xmm2 and xmm3 are used much more, so the benefit of different instruction lines (like the 3 pixel iteration blocks) goes down. And so minimizing instruction count isn't aways that usefull, especially not on Core 2 Duo architecture, also replacing MUL's with ADD migh not end in a better result. For me it's often try and error, some guys may be know it right away... Find the file attached with your changes. May be another idea ?


15 Oct 2009, 19:46 

Kuemmel
kalambong wrote: hmm... any new development? Hi Kalambong, unfortunatelly not, I ran out of ideas how to optimize further The only thing can be using the backtrace method of Xorpd! from the 64bit version to implement here, but I'm to lazy and it gains not that much. I'm more looking forward to wait for the next generation architecture of Intel and AMD, using especially SSE5 or AVX or whatever is called. Also the bus of the SSE unit is wider so I expect also a benefit for my existing SSE2 code...but seems like waiting for year 2010/2011...until that stuff is out... 

15 Oct 2009, 19:57 

Goto page Previous 1, 2, 3 ... 16, 17, 18, 19, 20 Next < Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.
Website powered by rwasa.