Mandelbrot Benchmark FPU/SSE2 released

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous 1, 2, 3 ... 6, 7, 8 ... 18, 19, 20 Next

Author

Thread

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 05 Nov 2007, 19:20

Xorpd! wrote:

I agree that the Xeon results look suspect -- I would guess that he doesn't have one of the cores enabled on the x64 tests. It seems this is a common problem. If you could get DaveB to fix that somehow and run all 10 or 11 benchmarks again, his system should end up seriously smoking everything we have seen to date...

Hm, he claims that everything is okay, also the fact that the KMB V0.53 scaled as normal with his machine indicates this, also that the Cinebench according to him was running normal...hard to say what's wrong...some instruction cache limit issues !? The inner loop of your versions is still too short for that problem, or ?

05 Nov 2007, 19:20

Xorpd!

Joined: 21 Dec 2006
Posts: 161

Xorpd! 05 Nov 2007, 22:36

Yeah, but I think the problem may be in how his x64 OS is set up. Clearly he has all 8 cores enabled when he boots to 32 bits, as can be seen from his results with MB_V0.53_MT_FPU.exe and MB_V0.53_MT_SSE2.exe.

This gives us a good opportunity, if DaveB is willing, to test my hypothesis: have him run those two 32-bit benchmarks while booted to x64. If my hypothesis is correct, he should get performances of about 494 and 1060 million iterations per second, whereas if all 8 cores are enabled, he should get about 988 and 2120 million iterations per second. This is what benchmarks are supposed to be good for, after all: to determine whether there are problems with your system by measuring performance and comparing with expected performance.

I checked my 64-bit version with dumpbin:

Code:

  000000000040164E: BA 01 00 00 00     mov         edx,1
  0000000000401653: 48 8B 0D 7E 60 00  mov         rcx,qword ptr [004076D8h]
                    00
  000000000040165A: E8 C5 39 00 00     call        0000000000405024
  000000000040165F: BA 02 00 00 00     mov         edx,2
  0000000000401664: 48 8B 0D 75 60 00  mov         rcx,qword ptr [004076E0h]
                    00
  000000000040166B: E8 B4 39 00 00     call        0000000000405024
  0000000000401670: BA 04 00 00 00     mov         edx,4
  0000000000401675: 48 8B 0D 6C 60 00  mov         rcx,qword ptr [004076E8h]
                    00
  000000000040167C: E8 A3 39 00 00     call        0000000000405024
  0000000000401681: BA 08 00 00 00     mov         edx,8
  0000000000401686: 48 8B 0D 63 60 00  mov         rcx,qword ptr [004076F0h]
                    00
  000000000040168D: E8 92 39 00 00     call        0000000000405024
  0000000000401692: BA 10 00 00 00     mov         edx,10h
  0000000000401697: 48 8B 0D 5A 60 00  mov         rcx,qword ptr [004076F8h]
                    00
  000000000040169E: E8 81 39 00 00     call        0000000000405024
  00000000004016A3: BA 20 00 00 00     mov         edx,20h
  00000000004016A8: 48 8B 0D 51 60 00  mov         rcx,qword ptr [00407700h]
                    00
  00000000004016AF: E8 70 39 00 00     call        0000000000405024
  00000000004016B4: BA 40 00 00 00     mov         edx,40h
  00000000004016B9: 48 8B 0D 48 60 00  mov         rcx,qword ptr [00407708h]
                    00
  00000000004016C0: E8 5F 39 00 00     call        0000000000405024
  00000000004016C5: BA 80 00 00 00     mov         edx,80h
  00000000004016CA: 48 8B 0D 3F 60 00  mov         rcx,qword ptr [00407710h]
                    00
  00000000004016D1: E8 4E 39 00 00     call        0000000000405024
  00000000004016D6: BA 00 01 00 00     mov         edx,100h
  00000000004016DB: 48 8B 0D 36 60 00  mov         rcx,qword ptr [00407718h]
                    00
  00000000004016E2: E8 3D 39 00 00     call        0000000000405024
  00000000004016E7: BA 00 02 00 00     mov         edx,200h
  00000000004016EC: 48 8B 0D 2D 60 00  mov         rcx,qword ptr [00407720h]
                    00
  00000000004016F3: E8 2C 39 00 00     call        0000000000405024
  00000000004016F8: BA 00 04 00 00     mov         edx,400h
  00000000004016FD: 48 8B 0D 24 60 00  mov         rcx,qword ptr [00407728h]
                    00
  0000000000401704: E8 1B 39 00 00     call        0000000000405024
  0000000000401709: BA 00 08 00 00     mov         edx,800h
  000000000040170E: 48 8B 0D 1B 60 00  mov         rcx,qword ptr [00407730h]
                    00
  0000000000401715: E8 0A 39 00 00     call        0000000000405024
  000000000040171A: BA 00 10 00 00     mov         edx,1000h
  000000000040171F: 48 8B 0D 12 60 00  mov         rcx,qword ptr [00407738h]
                    00
  0000000000401726: E8 F9 38 00 00     call        0000000000405024
  000000000040172B: BA 00 20 00 00     mov         edx,2000h
  0000000000401730: 48 8B 0D 09 60 00  mov         rcx,qword ptr [00407740h]
                    00
  0000000000401737: E8 E8 38 00 00     call        0000000000405024
  000000000040173C: BA 00 40 00 00     mov         edx,4000h
  0000000000401741: 48 8B 0D 00 60 00  mov         rcx,qword ptr [00407748h]
                    00
  0000000000401748: E8 D7 38 00 00     call        0000000000405024
  000000000040174D: BA 00 80 00 00     mov         edx,8000h
  0000000000401752: 48 8B 0D F7 5F 00  mov         rcx,qword ptr [00407750h]
                    00
  0000000000401759: E8 C6 38 00 00     call        0000000000405024

So I didn't do anything funny like set MAXTHREADS to 8 when I assembled the executable that's on my website. I replaced the original multithreading code (due to f0dder?) with macros controlled by a single constant, so something like that was a real possibility. Another way to check if there are 16 threads running is simply to look at the screen during rendering: with more threads than cores, lines appear and disappear across the images as threads are interrupted and resume. With at least as many cores as threads the images are rendered smoothly without this effect.

Also I don't know what the problem is with Rick5127. I get a similar problem when my monitor is set in portrait mode, but I do get the popup at the end which tells me the performance. This is the case for both the 32-bit and the 64-bit versions. Another possibility is that his system is fast enough that the benchmark is over before it has switched video modes successfully. One way to check for this is to run MB_V0.53_MT_FPU.exe because it may be slow enough that it could give his monitor/video card time to make the switch even if the others aren't. Also we could see if it's just a problem with my code because the 32-bit version is more thoroughly tested. That is also a possibility because he should have seen the popup if the program wasn't simply scrogged by the OS without notification.

Well, thanks again to all concerned and keep in touch.

05 Nov 2007, 22:36

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 08 Dec 2007, 14:16

First AMD Phenom Results !!!

It looks that AMD can regain the speed regarding SSE2, still Intel takes the lead at the same clock speed and Intel's just can clock more...

I updated the table at my webpage and also made one for Xorpd!s version to have an overview on the speed per MHZ and core at the same page below. The E5310 result isn't still confirmed as there was problems with enabled cores but the result fits to the other Core 2 Duo ones.

Here are the pure AMD Phenom Results I got:

KMB V0.53 MT:
AMD Phenom 9500 2500 MHz, 4 Cores:
FPU: 635.082
SSE2: 1260.500

Xorpd!s x64 version:
AMD Phenom 9500 2500 MHz, 4 Cores:
KMB0.57 2T_X1 - 623.935
KMB0.57 2T_X2 - 1142.545
KMB0.57 2T_X3 - 627.709
KMB0.57 2T_X4 - 1541.304
KMB0.57 MT_X1 - 1177.102
KMB0.57 MT_X2 - 2072.169
KMB0.57 MT_X3 - 1181.569
KMB0.57 MT_X4 - 2725.107

AMD Phenom 9500 1666 MHz, 4 Cores:
KMB0.57 2T_X1 - 416.235
KMB0.57 2T_X2 - 761.231
KMB0.57 2T_X3 - 418.472
KMB0.57 2T_X4 - 1029.234
KMB0.57 MT_X1 - 789.210
KMB0.57 MT_X2 - 1393.035
KMB0.57 MT_X3 - 799.341
KMB0.57 MT_X4 - 1834.129

Everything is in line, but what is really strange is the results for X3 version, any clue, Xorpd! !?

08 Dec 2007, 14:16

FrozenKnight

Joined: 24 Jun 2005
Posts: 128

FrozenKnight 08 Dec 2007, 21:56

AMD Athlon 64 X2 5400+ 2.8 GHZ some background apps running

FPU - 344.698
SSE2 - 472.987

Binary used from first post downloaded today.

formated according to included XLS file
AMD Athlon 64 X2 5400+ 2800 2 344.698 61.6 472.987 84.5 1.37

08 Dec 2007, 21:56

f0dder

Joined: 19 Feb 2004
Posts: 3174
Location: Denmark

f0dder 08 Dec 2007, 23:29

From what I've seen so far, AMD is heading for disaster with their latest chips :/

08 Dec 2007, 23:29

FrozenKnight

Joined: 24 Jun 2005
Posts: 128

FrozenKnight 09 Dec 2007, 11:18

i should probably remind you that that was in 32 bit mode. on a 64 bit chip. (i don't have a 64 bit version of windows yet)

and Intell may be getting better but they still cost 2 times as much for about the same performance level.

09 Dec 2007, 11:18

f0dder

Joined: 19 Feb 2004
Posts: 3174
Location: Denmark

f0dder 09 Dec 2007, 11:41

Oh really?

Phenom Quad 9600, (2.3GHz) : DKK 1.890
Core 2 Quad Q6600 (2.4GHz) : DKK 1.728

...and the results I've seen, the recent AMD chips are a bit slower per MHz, and AMD seems to have trouble reaching high enough frequencies to even compete with intel's top line...

If they had lower power consumption then that would be something, but it does look pretty bleak for AMD right now. Which is sad, really, I prefer some competition in the CPU market.

09 Dec 2007, 11:41

asmfan

Joined: 11 Aug 2006
Posts: 392
Location: Russian

asmfan 09 Dec 2007, 11:55

The thread seems to be trashed, and i also add some) Before AMD fix TLB problem in hardware (not only turning off logic in their firmware) they will only the 2nd on the market.

09 Dec 2007, 11:55

Xorpd!

Joined: 21 Dec 2006
Posts: 161

Xorpd! 11 Dec 2007, 21:22

The phenom results are very interesting. There is one obvious difference with the X3 code that I can recall. In this archive I have a test to see whether the problem is with the extra memory moves in that version. old_X3.exe is the original and does 1645.144 million iterations on my pc, whereas test_X3 is modified to use memory moves and only does 1602.798 million iterations per second. The latter version may run at normal rates on the phenom, however, and I would so much like to see the results of these two tests on that processor if possible.

11 Dec 2007, 21:22

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 12 Dec 2007, 08:14

I cannot wait for Agner to update the Penryn/Phenom part of his http://agner.org/optimize Instruction tables...

Does anyone know where does he get his CPUs from - donations or some other way?

12 Dec 2007, 08:14

rugxulo

Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)

rugxulo 12 Dec 2007, 13:06

f0dder wrote:

Oh really?

Phenom Quad 9600, (2.3GHz) : DKK 1.890
Core 2 Quad Q6600 (2.4GHz) : DKK 1.728

...and the results I've seen, the recent AMD chips are a bit slower per MHz, and AMD seems to have trouble reaching high enough frequencies to even compete with intel's top line...

If they had lower power consumption then that would be something, but it does look pretty bleak for AMD right now. Which is sad, really, I prefer some competition in the CPU market.

(further reading here):

12 Dec 2007, 13:06

f0dder

Joined: 19 Feb 2004
Posts: 3174
Location: Denmark

f0dder 12 Dec 2007, 13:15

C7 might have low power consumption, but it also has pretty low performance :/ - and from sporadic googling, it sounds like there's all sorts of chipset bugs/deficiencies as well. Too bad, the Padlock (AES/Rijndael hardware) sounds cute.

12 Dec 2007, 13:15

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 12 Dec 2007, 19:18

@Xorpd! : I passed the link to the guy with the Phenom machine, hoping to get some results soon !

@Madis731 : I think if you are a 'relevant' member of the overclocking scene or of some online magazines then you'll get engineering samples in a while, also somehow in Asia there's lots of engineering samples on ebay, I read in some overclocker forums...Intel or AMD might even have interest of the cpu's are good to spread them somehow to create some 'hype' in the web...

12 Dec 2007, 19:18

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 15 Dec 2007, 22:18

Xorpd! wrote:

The phenom results are very interesting. There is one obvious difference with the X3 code that I can recall. In this archive I have a test to see whether the problem is with the extra memory moves in that version. old_X3.exe is the original and does 1645.144 million iterations on my pc, whereas test_X3 is modified to use memory moves and only does 1602.798 million iterations per second. The latter version may run at normal rates on the phenom, however, and I would so much like to see the results of these two tests on that processor if possible.

Hi Xorpd!,

I got the results:

Phenom 2497MHz
KMB0.57 MT_X3 - 1172.668
KMB0.58 MT_X3 - 2418.201

So it seems you got the point solved with the new version ! Can you go into a little more detail what was the limit of the old one and why probably a Core 2 Duo doesn't care and a Phenom does !!!!?

15 Dec 2007, 22:18

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 15 Dec 2007, 23:54

@Kuemmel: I think 'relevant' was in order Smile

because I haven't actually clocked a CPU in my life (or anything else in that matter)...
...BUT (there's a big BUT) I am one very eager person to get the newest (espescially fastest) stuff on the market.

That's for a few things:
- Why do you think optimisations are so important?
- Why do you think new technologies are so important (SSEx)
- Smile

I am watching this thread, aren't I, so when I get a Penryn I will tell you the results Very Happy

15 Dec 2007, 23:54

Xorpd!

Joined: 21 Dec 2006
Posts: 161

Xorpd! 16 Dec 2007, 03:52

@Kümmel: For reference, let me post the inner loop:

Code:

.iteration_entry1:
macro single_step
{
if TESTME = 1
   movaps xmm3, xmm1           ; xmm3:    iz           |   iz+dz
   mulpd  xmm1, xmm8           ; xmm1:    2*iz         |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm1, xmm0           ; xmm1:    2*iz*rz      |   2*(iz+dz)*(rz+dz)
   mulpd  xmm0, xmm0           ; xmm0:    rz^2         |   (rz + dz)^2

   movaps [r11], xmm0           ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm0, xmm3           ; xmm0:    rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm1, xmm5           ; xmm1:    2*iz*rz+iz0  |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, [r11]           ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm0, xmm4           ; xmm0:    rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm6, xmm3           ; Add iteration counts

   movaps xmm3, xmm10          ; xmm3:    iz           |   iz+dz
   mulpd  xmm10, xmm8          ; xmm10:    2*iz        |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm10, xmm9          ; xmm10:    2*iz*rz     |   2*(iz+dz)*(rz+dz)
   mulpd  xmm9, xmm9           ; xmm9:    rz^2         |   (rz + dz)^2

   movaps [r11], xmm9           ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm9, xmm3           ; xmm9:    rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm10, xmm5          ; xmm10:    2*iz*rz+iz0 |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, [r11]           ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm9, xmm13          ; xmm9:    rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm15, xmm3          ; Add iteration counts

   movaps xmm3, xmm12          ; xmm3:    iz           |   iz+dz
   mulpd  xmm12, xmm8          ; xmm12:    2*iz        |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm12, xmm11         ; xmm12:    2*iz*rz     |   2*(iz+dz)*(rz+dz)
   mulpd  xmm11, xmm11         ; xmm11:   rz^2         |   (rz + dz)^2

   movaps [r11], xmm11          ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm11, xmm3          ; xmm11:   rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm12, xmm5          ; xmm12:    2*iz*rz+iz0 |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, [r11]           ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm11, [rsp+288]     ; xmm11:   rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm14, xmm3          ; Add iteration counts
else if TESTME = 2
   movaps xmm3, xmm1           ; xmm3:    iz           |   iz+dz
   mulpd  xmm1, xmm8           ; xmm1:    2*iz         |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm1, xmm0           ; xmm1:    2*iz*rz      |   2*(iz+dz)*(rz+dz)
   mulpd  xmm0, xmm0           ; xmm0:    rz^2         |   (rz + dz)^2

   movaps xmm2, xmm0           ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm0, xmm3           ; xmm0:    rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm1, xmm5           ; xmm1:    2*iz*rz+iz0  |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, xmm2           ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm0, xmm4           ; xmm0:    rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm6, xmm3           ; Add iteration counts

   movaps xmm3, xmm10          ; xmm3:    iz           |   iz+dz
   mulpd  xmm10, xmm8          ; xmm10:    2*iz        |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm10, xmm9          ; xmm10:    2*iz*rz     |   2*(iz+dz)*(rz+dz)
   mulpd  xmm9, xmm9           ; xmm9:    rz^2         |   (rz + dz)^2

   movaps xmm2, xmm9           ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm9, xmm3           ; xmm9:    rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm10, xmm5          ; xmm10:    2*iz*rz+iz0 |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, xmm2           ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm9, xmm13          ; xmm9:    rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm15, xmm3          ; Add iteration counts

   movaps xmm3, xmm12          ; xmm3:    iz           |   iz+dz
   mulpd  xmm12, xmm8          ; xmm12:    2*iz        |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm12, xmm11         ; xmm12:    2*iz*rz     |   2*(iz+dz)*(rz+dz)
   mulpd  xmm11, xmm11         ; xmm11:   rz^2         |   (rz + dz)^2

   movaps xmm2, xmm11          ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm11, xmm3          ; xmm11:   rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm12, xmm5          ; xmm12:    2*iz*rz+iz0 |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, xmm2           ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm11, [rsp+288]     ; xmm11:   rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm14, xmm3          ; Add iteration counts
else
   movaps xmm3, xmm1           ; xmm3:    iz           |   iz+dz
   mulpd  xmm1, xmm8           ; xmm1:    2*iz         |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm1, xmm0           ; xmm1:    2*iz*rz      |   2*(iz+dz)*(rz+dz)
   mulpd  xmm0, xmm0           ; xmm0:    rz^2         |   (rz + dz)^2

   movaps [r11], xmm0           ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm0, xmm3           ; xmm0:    rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm1, xmm5           ; xmm1:    2*iz*rz+iz0  |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, [r11]           ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm0, xmm4           ; xmm0:    rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm6, xmm3           ; Add iteration counts

   movaps xmm3, xmm10          ; xmm3:    iz           |   iz+dz
   mulpd  xmm10, xmm8          ; xmm10:    2*iz        |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm10, xmm9          ; xmm10:    2*iz*rz     |   2*(iz+dz)*(rz+dz)
   mulpd  xmm9, xmm9           ; xmm9:    rz^2         |   (rz + dz)^2

   movaps [r11+16], xmm9       ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm9, xmm3           ; xmm9:    rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm10, xmm5          ; xmm10:    2*iz*rz+iz0 |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, [r11+16]       ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm9, xmm13          ; xmm9:    rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm15, xmm3          ; Add iteration counts

   movaps xmm3, xmm12          ; xmm3:    iz           |   iz+dz
   mulpd  xmm12, xmm8          ; xmm12:    2*iz        |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm12, xmm11         ; xmm12:    2*iz*rz     |   2*(iz+dz)*(rz+dz)
   mulpd  xmm11, xmm11         ; xmm11:   rz^2         |   (rz + dz)^2

   movaps [r11+32], xmm11      ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm11, xmm3          ; xmm11:   rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm12, xmm5          ; xmm12:    2*iz*rz+iz0 |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, [r11+32]       ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm11, [rsp+288]     ; xmm11:   rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm14, xmm3          ; Add iteration counts
end if
}
repeat 3
   single_step
end repeat

The second possibility, TESTME = 2, is the most obvious choice. (Re z)² has been calculated and we need both (Re z)²-(Im z)² to get the next Re z and also (Re z)²+(Re z)² so that we may test |z|² to see whether we've diverged yet. The first thing you would think of is saving it , that is (Re z)², in another register (xmm2 in the actual code) and calculating the next Re z destroying the current register. Later, we restore the value of (Re z)² from the register we saved it and and use it to compute |z|².

The problem with this approach on Core 2 Duo is that movaps (and its buddies movapd and movdqa) when used to move values between two registers can execute in ports 0, 1, or 5. If it issues to port 0, it can take up a slot that could be used by a floating point multiply. If it goes to port 1, it could get in the way of a floating point add. The slotting logic is not smart enough to see this coming and sometimes it does seem to interfere with computational ports.

The solution to this problem on Core 2 Duo is to store (Re z)² in memory (at [r11] in the actual code) instead of a register (TESTME = 1). Of course this increases the latency of the save+restore sequence significantly from 3 to IIRC 5 clocks. This isn't a problem because the restored value of (Re z)² will be used in the out of band test for divergence rather than the sequential operation of computing the next value of z. As a consequence, the save and restore operations never get in the way of computational progress and the algorithm runs > 2% faster on Core 2 Duo.

This minor optimization wasn't undertaken for the other (i.e. X1, X2, and X4) tests so there wasn't any problem with them on phenom. But it seems that phenom can't use a memory location the way it can a register and the load from the previous loop iteration must complete or at least be well under way before the next store can be issued to it. Considering the results we have gathered I would say that Core 2 Duo, Pentium D Presler, and even Athlon 64 class processors can handle operations to a memory location out of order, but the phenom can only do so when it's a register, not memory.

As usual, an optimizer is never happy to leave things the way they are, and even though going back to a register temporary is twice as fast on phenom, it's a little slower on Core 2 Duo. The solution may be to store to a different memory location for each instruction stream (TESTME = 3). This is just as fast on Core 2 Duo as reusing the same memory location.

Code:

temporary TESTME executable M iter./s
   [r11]     1   testa_X3.exe 1645.144
   xmm2      2   testb_X3.exe 1611.091
   [r11+x]   3   testc_X3.exe 1645.144

This ZIP archive contains the two old and the one new version of the code. My hope is that phenom will be happy with testc_X3.exe because it uses different memory locations for different instruction streams. If your phenom owner is as curious as I about performance issues regarding this new processor, or at least has patience with us, you may be able to prevail on him to perform the tests with the new programs.

@Madis731: It's not necessary to have a Penryn to test KMB_V0.57_MT: all you need is x64 Windows. I am always looking for results from processors that aren't in my table.

16 Dec 2007, 03:52

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 16 Dec 2007, 09:51

Okay, so here are my results:
(EDIT: the final update, with all the tests)

I conducted 2 tests on the same CPU type. The first was an HP nc6320
laptop with 1GB RAM and integrated videocard. The other one was Intel
BTO laptop with 2GB (slow) RAM and GeForce 7600 videocard. The Intel
tests were ran over a terminal connection. Seems that only the CPU is
what matters so use the test you want for reference:

Code:

;The HP lap with a T7200 (Server 2003 Enterprise x64 Ed.)
;-----------------------------------------------------------------
2T:
 609,581 ;This is @2GHz x 2 then, okay 
 900,487
1260,500
1470,335

MT:
 603,964
 883,224
1231,824
1429,821

Kümmel:
FPU: 250.482
SSE: 547.899

QuickMAN (as instructed):
MIters/s: 532.4

Code:

;The Intel lap with a T7200 (Server 2003 Standard x64 Ed.)
;-----------------------------------------------------------------
2T:
 609,581
 903,099
1255,417
1470,335

MT:
 601,630
 885,757
1231,824
1429,821

Kümmel:
FPU: 251.849
SSE: 550.807

QuickMAN (as instructed):
MIters/s: 532.4

There's also a quadcore PC in the pack and it looks like this:

Code:

;With a Q6600 (Server 2003 Standard x64 Ed.)
2T:
 710,424 ;This is @2.4GHz x 2 ...
1049,178
1463,424
1715,391

MT:
1374,585 ; and this is @2.4GHz x 4
1945,896
2798,593
3129,080

Kümmel:
FPU: 591.637
SSE:1281.249

QuickMAN (as instructed):
MIters/s: 643.1

Last edited by Madis731 on 16 Dec 2007, 12:18; edited 4 times in total

16 Dec 2007, 09:51

Kuemmel

Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany

Kuemmel 16 Dec 2007, 11:02

Madis731 wrote:

- Why do you think optimisations are so important?
- Why do you think new technologies are so important (SSEx)
- I am watching this thread, aren't I, so when I get a Penryn I will tell you the results

Hi Madis731,

...I think optimisations like seen here are very important because it could make people buy the slowest Core 2 Duo instead of the Extreme Edition, what would be 800 Euro difference, if just the coders would optimise a little more Wink

...SSE is very important, when you look at the difference achieved here in contrast to the FPU-version KMB 0.53, you'll see that the use of FPU becomes totally obsolete if you don't need extended precision. So even if you have a calculation that can't be vectorized you are probably still faster when using SSE.

And with SSE4 (Penryn) and SSE5 (AMD, somewhere 2009) things get even better. Some examples that I can directly think of being benefitial for the algorithm here:
SSE4: PTEST (like 'TEST' for SSE instructions)
SSE5: FMADD (multiply-add, finally !)

@Xorpd!: I passed the link to the guy with the Phenom I'm in contact, let's see what he'll get. Thanks for the explanations...really interesting...as I'm more used to code for ARM-Cpu's I wouldn't haven even thought of using memory to store intermediates, as memory access is so slow...but the ARM already got 16 registers to use Wink

@EDIT: Got the results:
KMB0.58a MT_X3 - 1177.102
KMB0.58b MT_X3 - 2363.138
KMB0.58c MT_X3 - 2310.526
Any more conclusions from that, Xorpd! ?

By the way, to get people testing the latest stuff I always try to convince some guys from http://www.xtremesystems.org/forums/ ...crazy people with liquid nitrogen pushing Core 2 Duo beyond 6 GHz. Sometimes quite interesting to read...just total hardware 'optimisations' instead of software like here.

16 Dec 2007, 11:02

kohlrak

Joined: 21 Jul 2006
Posts: 1413
Location: Uncle Sam's Pad

kohlrak 16 Dec 2007, 11:59

Quote:

...I think optimisations like seen here are very important because it could make people buy the slowest Core 2 Duo instead of the Extreme Edition, what would be 800 Euro difference, if just the coders would optimise a little more

What about smoother running for other applications in a multi-tasking environment? Also, what about more enemies in a game. Take away all the bottle necks and you got some software that can up the requirements for itself. If i could ever sort out this rotation issue in opengl, i could probably make a small game that uses this to it's advantage.

16 Dec 2007, 11:59

Madis731

Joined: 25 Sep 2003
Posts: 2138
Location: Estonia

Madis731 16 Dec 2007, 12:26

Kuemmel wrote:

...I think optimisations like seen here are very important because it could make people buy the slowest Core 2 Duo instead of the Extreme Edition, what would be 800 Euro difference, if just the coders would optimise a little more

...SSE is very important, when you look at the difference achieved here in contrast to the FPU-version KMB 0.53, you'll see that the use of FPU becomes totally obsolete if you don't need extended precision. So even if you have a calculation that can't be vectorized you are probably still faster when using SSE.

And with SSE4 (Penryn) and SSE5 (AMD, somewhere 2009) things get even better. Some examples that I can directly think of being benefitial for the algorithm here:
SSE4: PTEST (like 'TEST' for SSE instructions)
SSE5: FMADD (multiply-add, finally !)

Sorry if I didn't sound rhetorical enough Smile

but I didn't expect you to answer these questions Razz

Anyway, thank you for doing that and I agree totally!

_________________
My updated idol Very Happy

http://www.agner.org/optimize/

16 Dec 2007, 12:26

Goto page Previous 1, 2, 3 ... 6, 7, 8 ... 18, 19, 20 Next

< Last Thread | Next Thread >

Forum Rules:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum