flat assembler
Message board for the users of flat assembler.

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous  1, 2, 3 ... 6, 7, 8 ... 18, 19, 20  Next
Author
Thread Post new topic Reply to topic
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
Xorpd! wrote:
I agree that the Xeon results look suspect -- I would guess that he doesn't have one of the cores enabled on the x64 tests. It seems this is a common problem. If you could get DaveB to fix that somehow and run all 10 or 11 benchmarks again, his system should end up seriously smoking everything we have seen to date...

Hm, he claims that everything is okay, also the fact that the KMB V0.53 scaled as normal with his machine indicates this, also that the Cinebench according to him was running normal...hard to say what's wrong...some instruction cache limit issues !? The inner loop of your versions is still too short for that problem, or ?
Post 05 Nov 2007, 19:20
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
Yeah, but I think the problem may be in how his x64 OS is set up. Clearly he has all 8 cores enabled when he boots to 32 bits, as can be seen from his results with MB_V0.53_MT_FPU.exe and MB_V0.53_MT_SSE2.exe.

This gives us a good opportunity, if DaveB is willing, to test my hypothesis: have him run those two 32-bit benchmarks while booted to x64. If my hypothesis is correct, he should get performances of about 494 and 1060 million iterations per second, whereas if all 8 cores are enabled, he should get about 988 and 2120 million iterations per second. This is what benchmarks are supposed to be good for, after all: to determine whether there are problems with your system by measuring performance and comparing with expected performance.

I checked my 64-bit version with dumpbin:
Code:
  000000000040164E: BA 01 00 00 00     mov         edx,1
  0000000000401653: 48 8B 0D 7E 60 00  mov         rcx,qword ptr [004076D8h]
                    00
  000000000040165A: E8 C5 39 00 00     call        0000000000405024
  000000000040165F: BA 02 00 00 00     mov         edx,2
  0000000000401664: 48 8B 0D 75 60 00  mov         rcx,qword ptr [004076E0h]
                    00
  000000000040166B: E8 B4 39 00 00     call        0000000000405024
  0000000000401670: BA 04 00 00 00     mov         edx,4
  0000000000401675: 48 8B 0D 6C 60 00  mov         rcx,qword ptr [004076E8h]
                    00
  000000000040167C: E8 A3 39 00 00     call        0000000000405024
  0000000000401681: BA 08 00 00 00     mov         edx,8
  0000000000401686: 48 8B 0D 63 60 00  mov         rcx,qword ptr [004076F0h]
                    00
  000000000040168D: E8 92 39 00 00     call        0000000000405024
  0000000000401692: BA 10 00 00 00     mov         edx,10h
  0000000000401697: 48 8B 0D 5A 60 00  mov         rcx,qword ptr [004076F8h]
                    00
  000000000040169E: E8 81 39 00 00     call        0000000000405024
  00000000004016A3: BA 20 00 00 00     mov         edx,20h
  00000000004016A8: 48 8B 0D 51 60 00  mov         rcx,qword ptr [00407700h]
                    00
  00000000004016AF: E8 70 39 00 00     call        0000000000405024
  00000000004016B4: BA 40 00 00 00     mov         edx,40h
  00000000004016B9: 48 8B 0D 48 60 00  mov         rcx,qword ptr [00407708h]
                    00
  00000000004016C0: E8 5F 39 00 00     call        0000000000405024
  00000000004016C5: BA 80 00 00 00     mov         edx,80h
  00000000004016CA: 48 8B 0D 3F 60 00  mov         rcx,qword ptr [00407710h]
                    00
  00000000004016D1: E8 4E 39 00 00     call        0000000000405024
  00000000004016D6: BA 00 01 00 00     mov         edx,100h
  00000000004016DB: 48 8B 0D 36 60 00  mov         rcx,qword ptr [00407718h]
                    00
  00000000004016E2: E8 3D 39 00 00     call        0000000000405024
  00000000004016E7: BA 00 02 00 00     mov         edx,200h
  00000000004016EC: 48 8B 0D 2D 60 00  mov         rcx,qword ptr [00407720h]
                    00
  00000000004016F3: E8 2C 39 00 00     call        0000000000405024
  00000000004016F8: BA 00 04 00 00     mov         edx,400h
  00000000004016FD: 48 8B 0D 24 60 00  mov         rcx,qword ptr [00407728h]
                    00
  0000000000401704: E8 1B 39 00 00     call        0000000000405024
  0000000000401709: BA 00 08 00 00     mov         edx,800h
  000000000040170E: 48 8B 0D 1B 60 00  mov         rcx,qword ptr [00407730h]
                    00
  0000000000401715: E8 0A 39 00 00     call        0000000000405024
  000000000040171A: BA 00 10 00 00     mov         edx,1000h
  000000000040171F: 48 8B 0D 12 60 00  mov         rcx,qword ptr [00407738h]
                    00
  0000000000401726: E8 F9 38 00 00     call        0000000000405024
  000000000040172B: BA 00 20 00 00     mov         edx,2000h
  0000000000401730: 48 8B 0D 09 60 00  mov         rcx,qword ptr [00407740h]
                    00
  0000000000401737: E8 E8 38 00 00     call        0000000000405024
  000000000040173C: BA 00 40 00 00     mov         edx,4000h
  0000000000401741: 48 8B 0D 00 60 00  mov         rcx,qword ptr [00407748h]
                    00
  0000000000401748: E8 D7 38 00 00     call        0000000000405024
  000000000040174D: BA 00 80 00 00     mov         edx,8000h
  0000000000401752: 48 8B 0D F7 5F 00  mov         rcx,qword ptr [00407750h]
                    00
  0000000000401759: E8 C6 38 00 00     call        0000000000405024
    

So I didn't do anything funny like set MAXTHREADS to 8 when I assembled the executable that's on my website. I replaced the original multithreading code (due to f0dder?) with macros controlled by a single constant, so something like that was a real possibility. Another way to check if there are 16 threads running is simply to look at the screen during rendering: with more threads than cores, lines appear and disappear across the images as threads are interrupted and resume. With at least as many cores as threads the images are rendered smoothly without this effect.

Also I don't know what the problem is with Rick5127. I get a similar problem when my monitor is set in portrait mode, but I do get the popup at the end which tells me the performance. This is the case for both the 32-bit and the 64-bit versions. Another possibility is that his system is fast enough that the benchmark is over before it has switched video modes successfully. One way to check for this is to run MB_V0.53_MT_FPU.exe because it may be slow enough that it could give his monitor/video card time to make the switch even if the others aren't. Also we could see if it's just a problem with my code because the 32-bit version is more thoroughly tested. That is also a possibility because he should have seen the popup if the program wasn't simply scrogged by the OS without notification.

Well, thanks again to all concerned and keep in touch.
Post 05 Nov 2007, 22:36
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
First AMD Phenom Results !!!

It looks that AMD can regain the speed regarding SSE2, still Intel takes the lead at the same clock speed and Intel's just can clock more...

I updated the table at my webpage and also made one for Xorpd!s version to have an overview on the speed per MHZ and core at the same page below. The E5310 result isn't still confirmed as there was problems with enabled cores but the result fits to the other Core 2 Duo ones.

Here are the pure AMD Phenom Results I got:

KMB V0.53 MT:
AMD Phenom 9500 2500 MHz, 4 Cores:
FPU: 635.082
SSE2: 1260.500

Xorpd!s x64 version:
AMD Phenom 9500 2500 MHz, 4 Cores:
KMB0.57 2T_X1 - 623.935
KMB0.57 2T_X2 - 1142.545
KMB0.57 2T_X3 - 627.709
KMB0.57 2T_X4 - 1541.304
KMB0.57 MT_X1 - 1177.102
KMB0.57 MT_X2 - 2072.169
KMB0.57 MT_X3 - 1181.569
KMB0.57 MT_X4 - 2725.107

AMD Phenom 9500 1666 MHz, 4 Cores:
KMB0.57 2T_X1 - 416.235
KMB0.57 2T_X2 - 761.231
KMB0.57 2T_X3 - 418.472
KMB0.57 2T_X4 - 1029.234
KMB0.57 MT_X1 - 789.210
KMB0.57 MT_X2 - 1393.035
KMB0.57 MT_X3 - 799.341
KMB0.57 MT_X4 - 1834.129

Everything is in line, but what is really strange is the results for X3 version, any clue, Xorpd! !?
Post 08 Dec 2007, 14:16
View user's profile Send private message Visit poster's website Reply with quote
FrozenKnight



Joined: 24 Jun 2005
Posts: 128
FrozenKnight
AMD Athlon 64 X2 5400+ 2.8 GHZ some background apps running

FPU - 344.698
SSE2 - 472.987

Binary used from first post downloaded today.

formated according to included XLS file
AMD Athlon 64 X2 5400+ 2800 2 344.698 61.6 472.987 84.5 1.37
Post 08 Dec 2007, 21:56
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
From what I've seen so far, AMD is heading for disaster with their latest chips :/
Post 08 Dec 2007, 23:29
View user's profile Send private message Visit poster's website Reply with quote
FrozenKnight



Joined: 24 Jun 2005
Posts: 128
FrozenKnight
i should probably remind you that that was in 32 bit mode. on a 64 bit chip. (i don't have a 64 bit version of windows yet)

and Intell may be getting better but they still cost 2 times as much for about the same performance level.
Post 09 Dec 2007, 11:18
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Oh really?

Phenom Quad 9600, (2.3GHz) : DKK 1.890
Core 2 Quad Q6600 (2.4GHz) : DKK 1.728

...and the results I've seen, the recent AMD chips are a bit slower per MHz, and AMD seems to have trouble reaching high enough frequencies to even compete with intel's top line...

If they had lower power consumption then that would be something, but it does look pretty bleak for AMD right now. Which is sad, really, I prefer some competition in the CPU market.
Post 09 Dec 2007, 11:41
View user's profile Send private message Visit poster's website Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan
The thread seems to be trashed, and i also add some) Before AMD fix TLB problem in hardware (not only turning off logic in their firmware) they will only the 2nd on the market.
Post 09 Dec 2007, 11:55
View user's profile Send private message Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
The phenom results are very interesting. There is one obvious difference with the X3 code that I can recall. In this archive I have a test to see whether the problem is with the extra memory moves in that version. old_X3.exe is the original and does 1645.144 million iterations on my pc, whereas test_X3 is modified to use memory moves and only does 1602.798 million iterations per second. The latter version may run at normal rates on the phenom, however, and I would so much like to see the results of these two tests on that processor if possible.
Post 11 Dec 2007, 21:22
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
I cannot wait for Agner to update the Penryn/Phenom part of his http://agner.org/optimize Instruction tables...

Does anyone know where does he get his CPUs from - donations or some other way?
Post 12 Dec 2007, 08:14
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo
f0dder wrote:
Oh really?

Phenom Quad 9600, (2.3GHz) : DKK 1.890
Core 2 Quad Q6600 (2.4GHz) : DKK 1.728

...and the results I've seen, the recent AMD chips are a bit slower per MHz, and AMD seems to have trouble reaching high enough frequencies to even compete with intel's top line...

If they had lower power consumption then that would be something, but it does look pretty bleak for AMD right now. Which is sad, really, I prefer some competition in the CPU market.


(further reading here):

Post 12 Dec 2007, 13:06
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
C7 might have low power consumption, but it also has pretty low performance :/ - and from sporadic googling, it sounds like there's all sorts of chipset bugs/deficiencies as well. Too bad, the Padlock (AES/Rijndael hardware) sounds cute.
Post 12 Dec 2007, 13:15
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
@Xorpd! : I passed the link to the guy with the Phenom machine, hoping to get some results soon !

@Madis731 : I think if you are a 'relevant' member of the overclocking scene or of some online magazines then you'll get engineering samples in a while, also somehow in Asia there's lots of engineering samples on ebay, I read in some overclocker forums...Intel or AMD might even have interest of the cpu's are good to spread them somehow to create some 'hype' in the web...
Post 12 Dec 2007, 19:18
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
Xorpd! wrote:
The phenom results are very interesting. There is one obvious difference with the X3 code that I can recall. In this archive I have a test to see whether the problem is with the extra memory moves in that version. old_X3.exe is the original and does 1645.144 million iterations on my pc, whereas test_X3 is modified to use memory moves and only does 1602.798 million iterations per second. The latter version may run at normal rates on the phenom, however, and I would so much like to see the results of these two tests on that processor if possible.

Hi Xorpd!,

I got the results:

Phenom 2497MHz
KMB0.57 MT_X3 - 1172.668
KMB0.58 MT_X3 - 2418.201

So it seems you got the point solved with the new version ! Can you go into a little more detail what was the limit of the old one and why probably a Core 2 Duo doesn't care and a Phenom does !!!!?
Post 15 Dec 2007, 22:18
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
@Kuemmel: I think 'relevant' was in order Smile because I haven't actually clocked a CPU in my life (or anything else in that matter)...
...BUT (there's a big BUT) I am one very eager person to get the newest (espescially fastest) stuff on the market.

That's for a few things:
- Why do you think optimisations are so important?
- Why do you think new technologies are so important (SSEx)
- Smile I am watching this thread, aren't I, so when I get a Penryn I will tell you the results Very Happy
Post 15 Dec 2007, 23:54
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
@Kümmel: For reference, let me post the inner loop:
Code:
.iteration_entry1:
macro single_step
{
if TESTME = 1
   movaps xmm3, xmm1           ; xmm3:    iz           |   iz+dz
   mulpd  xmm1, xmm8           ; xmm1:    2*iz         |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm1, xmm0           ; xmm1:    2*iz*rz      |   2*(iz+dz)*(rz+dz)
   mulpd  xmm0, xmm0           ; xmm0:    rz^2         |   (rz + dz)^2

   movaps [r11], xmm0           ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm0, xmm3           ; xmm0:    rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm1, xmm5           ; xmm1:    2*iz*rz+iz0  |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, [r11]           ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm0, xmm4           ; xmm0:    rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm6, xmm3           ; Add iteration counts

   movaps xmm3, xmm10          ; xmm3:    iz           |   iz+dz
   mulpd  xmm10, xmm8          ; xmm10:    2*iz        |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm10, xmm9          ; xmm10:    2*iz*rz     |   2*(iz+dz)*(rz+dz)
   mulpd  xmm9, xmm9           ; xmm9:    rz^2         |   (rz + dz)^2

   movaps [r11], xmm9           ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm9, xmm3           ; xmm9:    rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm10, xmm5          ; xmm10:    2*iz*rz+iz0 |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, [r11]           ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm9, xmm13          ; xmm9:    rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm15, xmm3          ; Add iteration counts

   movaps xmm3, xmm12          ; xmm3:    iz           |   iz+dz
   mulpd  xmm12, xmm8          ; xmm12:    2*iz        |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm12, xmm11         ; xmm12:    2*iz*rz     |   2*(iz+dz)*(rz+dz)
   mulpd  xmm11, xmm11         ; xmm11:   rz^2         |   (rz + dz)^2

   movaps [r11], xmm11          ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm11, xmm3          ; xmm11:   rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm12, xmm5          ; xmm12:    2*iz*rz+iz0 |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, [r11]           ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm11, [rsp+288]     ; xmm11:   rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm14, xmm3          ; Add iteration counts
else if TESTME = 2
   movaps xmm3, xmm1           ; xmm3:    iz           |   iz+dz
   mulpd  xmm1, xmm8           ; xmm1:    2*iz         |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm1, xmm0           ; xmm1:    2*iz*rz      |   2*(iz+dz)*(rz+dz)
   mulpd  xmm0, xmm0           ; xmm0:    rz^2         |   (rz + dz)^2

   movaps xmm2, xmm0           ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm0, xmm3           ; xmm0:    rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm1, xmm5           ; xmm1:    2*iz*rz+iz0  |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, xmm2           ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm0, xmm4           ; xmm0:    rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm6, xmm3           ; Add iteration counts

   movaps xmm3, xmm10          ; xmm3:    iz           |   iz+dz
   mulpd  xmm10, xmm8          ; xmm10:    2*iz        |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm10, xmm9          ; xmm10:    2*iz*rz     |   2*(iz+dz)*(rz+dz)
   mulpd  xmm9, xmm9           ; xmm9:    rz^2         |   (rz + dz)^2

   movaps xmm2, xmm9           ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm9, xmm3           ; xmm9:    rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm10, xmm5          ; xmm10:    2*iz*rz+iz0 |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, xmm2           ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm9, xmm13          ; xmm9:    rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm15, xmm3          ; Add iteration counts

   movaps xmm3, xmm12          ; xmm3:    iz           |   iz+dz
   mulpd  xmm12, xmm8          ; xmm12:    2*iz        |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm12, xmm11         ; xmm12:    2*iz*rz     |   2*(iz+dz)*(rz+dz)
   mulpd  xmm11, xmm11         ; xmm11:   rz^2         |   (rz + dz)^2

   movaps xmm2, xmm11          ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm11, xmm3          ; xmm11:   rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm12, xmm5          ; xmm12:    2*iz*rz+iz0 |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, xmm2           ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm11, [rsp+288]     ; xmm11:   rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm14, xmm3          ; Add iteration counts
else
   movaps xmm3, xmm1           ; xmm3:    iz           |   iz+dz
   mulpd  xmm1, xmm8           ; xmm1:    2*iz         |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm1, xmm0           ; xmm1:    2*iz*rz      |   2*(iz+dz)*(rz+dz)
   mulpd  xmm0, xmm0           ; xmm0:    rz^2         |   (rz + dz)^2

   movaps [r11], xmm0           ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm0, xmm3           ; xmm0:    rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm1, xmm5           ; xmm1:    2*iz*rz+iz0  |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, [r11]           ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm0, xmm4           ; xmm0:    rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm6, xmm3           ; Add iteration counts

   movaps xmm3, xmm10          ; xmm3:    iz           |   iz+dz
   mulpd  xmm10, xmm8          ; xmm10:    2*iz        |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm10, xmm9          ; xmm10:    2*iz*rz     |   2*(iz+dz)*(rz+dz)
   mulpd  xmm9, xmm9           ; xmm9:    rz^2         |   (rz + dz)^2

   movaps [r11+16], xmm9       ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm9, xmm3           ; xmm9:    rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm10, xmm5          ; xmm10:    2*iz*rz+iz0 |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, [r11+16]       ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm9, xmm13          ; xmm9:    rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm15, xmm3          ; Add iteration counts

   movaps xmm3, xmm12          ; xmm3:    iz           |   iz+dz
   mulpd  xmm12, xmm8          ; xmm12:    2*iz        |   2*(iz+dz)
   mulpd  xmm3, xmm3           ; xmm3:    iz^2         |   (iz+dz)^2
   mulpd  xmm12, xmm11         ; xmm12:    2*iz*rz     |   2*(iz+dz)*(rz+dz)
   mulpd  xmm11, xmm11         ; xmm11:   rz^2         |   (rz + dz)^2

   movaps [r11+32], xmm11      ; xmm2:    rz^2         |   (rz+dz)^2
   subpd  xmm11, xmm3          ; xmm11:   rz^2-iz^2    |   (rz+dz)^2-(iz+dz)^2
   addpd  xmm12, xmm5          ; xmm12:    2*iz*rz+iz0 |   2*(iz+dz)*(rz+dz)+iz0+dz
   addpd  xmm3, [r11+32]       ; xmm2:    rz^2+iz^2    |   (rz+dz)^2+(iz+dz)^2
   cmplepd xmm3, xmm7          ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
   addpd  xmm11, [rsp+288]     ; xmm11:   rz^2-iz^2+rz0|   (rz+dz)^2-(iz+dz)^2+rz0+dz

   psubd  xmm14, xmm3          ; Add iteration counts
end if
}
repeat 3
   single_step
end repeat
    

The second possibility, TESTME = 2, is the most obvious choice. (Re z)² has been calculated and we need both (Re z)²-(Im z)² to get the next Re z and also (Re z)²+(Re z)² so that we may test |z|² to see whether we've diverged yet. The first thing you would think of is saving it , that is (Re z)², in another register (xmm2 in the actual code) and calculating the next Re z destroying the current register. Later, we restore the value of (Re z)² from the register we saved it and and use it to compute |z|².

The problem with this approach on Core 2 Duo is that movaps (and its buddies movapd and movdqa) when used to move values between two registers can execute in ports 0, 1, or 5. If it issues to port 0, it can take up a slot that could be used by a floating point multiply. If it goes to port 1, it could get in the way of a floating point add. The slotting logic is not smart enough to see this coming and sometimes it does seem to interfere with computational ports.

The solution to this problem on Core 2 Duo is to store (Re z)² in memory (at [r11] in the actual code) instead of a register (TESTME = 1). Of course this increases the latency of the save+restore sequence significantly from 3 to IIRC 5 clocks. This isn't a problem because the restored value of (Re z)² will be used in the out of band test for divergence rather than the sequential operation of computing the next value of z. As a consequence, the save and restore operations never get in the way of computational progress and the algorithm runs > 2% faster on Core 2 Duo.

This minor optimization wasn't undertaken for the other (i.e. X1, X2, and X4) tests so there wasn't any problem with them on phenom. But it seems that phenom can't use a memory location the way it can a register and the load from the previous loop iteration must complete or at least be well under way before the next store can be issued to it. Considering the results we have gathered I would say that Core 2 Duo, Pentium D Presler, and even Athlon 64 class processors can handle operations to a memory location out of order, but the phenom can only do so when it's a register, not memory.

As usual, an optimizer is never happy to leave things the way they are, and even though going back to a register temporary is twice as fast on phenom, it's a little slower on Core 2 Duo. The solution may be to store to a different memory location for each instruction stream (TESTME = 3). This is just as fast on Core 2 Duo as reusing the same memory location.
Code:
temporary TESTME executable M iter./s
   [r11]     1   testa_X3.exe 1645.144
   xmm2      2   testb_X3.exe 1611.091
   [r11+x]   3   testc_X3.exe 1645.144
    

This ZIP archive contains the two old and the one new version of the code. My hope is that phenom will be happy with testc_X3.exe because it uses different memory locations for different instruction streams. If your phenom owner is as curious as I about performance issues regarding this new processor, or at least has patience with us, you may be able to prevail on him to perform the tests with the new programs.

@Madis731: It's not necessary to have a Penryn to test KMB_V0.57_MT: all you need is x64 Windows. I am always looking for results from processors that aren't in my table.
Post 16 Dec 2007, 03:52
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
Okay, so here are my results:
(EDIT: the final update, with all the tests)

I conducted 2 tests on the same CPU type. The first was an HP nc6320
laptop with 1GB RAM and integrated videocard. The other one was Intel
BTO laptop with 2GB (slow) RAM and GeForce 7600 videocard. The Intel
tests were ran over a terminal connection. Seems that only the CPU is
what matters so use the test you want for reference:
Code:
;The HP lap with a T7200 (Server 2003 Enterprise x64 Ed.)
;-----------------------------------------------------------------
2T:
 609,581 ;This is @2GHz x 2 then, okay Smile
 900,487
1260,500
1470,335

MT:
 603,964
 883,224
1231,824
1429,821

Kümmel:
FPU: 250.482
SSE: 547.899

QuickMAN (as instructed):
MIters/s: 532.4
    


Code:
;The Intel lap with a T7200 (Server 2003 Standard x64 Ed.)
;-----------------------------------------------------------------
2T:
 609,581
 903,099
1255,417
1470,335

MT:
 601,630
 885,757
1231,824
1429,821

Kümmel:
FPU: 251.849
SSE: 550.807

QuickMAN (as instructed):
MIters/s: 532.4
    

There's also a quadcore PC in the pack and it looks like this:
Code:
;With a Q6600 (Server 2003 Standard x64 Ed.)
2T:
 710,424 ;This is @2.4GHz x 2 ...
1049,178
1463,424
1715,391

MT:
1374,585 ; and this is @2.4GHz x 4
1945,896
2798,593
3129,080

Kümmel:
FPU: 591.637
SSE:1281.249

QuickMAN (as instructed):
MIters/s: 643.1
    


Last edited by Madis731 on 16 Dec 2007, 12:18; edited 4 times in total
Post 16 Dec 2007, 09:51
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
Madis731 wrote:

- Why do you think optimisations are so important?
- Why do you think new technologies are so important (SSEx)
- Smile I am watching this thread, aren't I, so when I get a Penryn I will tell you the results Very Happy

Hi Madis731,

...I think optimisations like seen here are very important because it could make people buy the slowest Core 2 Duo instead of the Extreme Edition, what would be 800 Euro difference, if just the coders would optimise a little more Wink

...SSE is very important, when you look at the difference achieved here in contrast to the FPU-version KMB 0.53, you'll see that the use of FPU becomes totally obsolete if you don't need extended precision. So even if you have a calculation that can't be vectorized you are probably still faster when using SSE.

And with SSE4 (Penryn) and SSE5 (AMD, somewhere 2009) things get even better. Some examples that I can directly think of being benefitial for the algorithm here:
SSE4: PTEST (like 'TEST' for SSE instructions)
SSE5: FMADD (multiply-add, finally !)

@Xorpd!: I passed the link to the guy with the Phenom I'm in contact, let's see what he'll get. Thanks for the explanations...really interesting...as I'm more used to code for ARM-Cpu's I wouldn't haven even thought of using memory to store intermediates, as memory access is so slow...but the ARM already got 16 registers to use Wink

@EDIT: Got the results:
KMB0.58a MT_X3 - 1177.102
KMB0.58b MT_X3 - 2363.138
KMB0.58c MT_X3 - 2310.526
Any more conclusions from that, Xorpd! ?

By the way, to get people testing the latest stuff I always try to convince some guys from http://www.xtremesystems.org/forums/ ...crazy people with liquid nitrogen pushing Core 2 Duo beyond 6 GHz. Sometimes quite interesting to read...just total hardware 'optimisations' instead of software like here.
Post 16 Dec 2007, 11:02
View user's profile Send private message Visit poster's website Reply with quote
kohlrak



Joined: 21 Jul 2006
Posts: 1421
Location: Uncle Sam's Pad
kohlrak
Quote:
...I think optimisations like seen here are very important because it could make people buy the slowest Core 2 Duo instead of the Extreme Edition, what would be 800 Euro difference, if just the coders would optimise a little more


What about smoother running for other applications in a multi-tasking environment? Also, what about more enemies in a game. Take away all the bottle necks and you got some software that can up the requirements for itself. If i could ever sort out this rotation issue in opengl, i could probably make a small game that uses this to it's advantage.
Post 16 Dec 2007, 11:59
View user's profile Send private message Visit poster's website AIM Address Yahoo Messenger MSN Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2140
Location: Estonia
Madis731
Kuemmel wrote:

...I think optimisations like seen here are very important because it could make people buy the slowest Core 2 Duo instead of the Extreme Edition, what would be 800 Euro difference, if just the coders would optimise a little more Wink

...SSE is very important, when you look at the difference achieved here in contrast to the FPU-version KMB 0.53, you'll see that the use of FPU becomes totally obsolete if you don't need extended precision. So even if you have a calculation that can't be vectorized you are probably still faster when using SSE.

And with SSE4 (Penryn) and SSE5 (AMD, somewhere 2009) things get even better. Some examples that I can directly think of being benefitial for the algorithm here:
SSE4: PTEST (like 'TEST' for SSE instructions)
SSE5: FMADD (multiply-add, finally !)


Sorry if I didn't sound rhetorical enough Smile but I didn't expect you to answer these questions Razz
Anyway, thank you for doing that and I agree totally!

_________________
My updated idol Very Happy http://www.agner.org/optimize/
Post 16 Dec 2007, 12:26
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3 ... 6, 7, 8 ... 18, 19, 20  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.