flat assembler
Message board for the users of flat assembler.

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous  1, 2, 3, 4, 5 ... 18, 19, 20  Next
Author
Thread Post new topic Reply to topic
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Most P4 BIOSes allow you to turn off HyperThreading - if not, you could always install a non-SMP ntoskrnl.exe, but that's not too fun. Or perhaps duplicate your boot.ini entry and add /SINGLECPU .
Post 10 May 2006, 15:10
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
New results seem to prove the effect of the Hyper-Threading:
http://www.mikusite.de/pages/x86.htm
But strange enough, it doesn't live up for the Intel Dual Cores. One guy tested a Pentium EE with Dual Core and Hyper-Threading...same result per MHz like a Pentium D without HT...really interesting, but I got no conclusion why...
Post 10 May 2006, 17:44
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
BTW, the guys who get bad results with multiple cores should take care that
ALL their CORES AND THREADS wind up and not end in a power-saving ritual.
They should leave TaskManager running in the first run and if really 2 threads
(4 threads on EE) wind up to at least 95% each. If it does happen then run
the test without anything running in the background - I mean NOTHING:
you can shut down many services and you can disable you LAN and shutdown
AntiVirus (I think you know why disabling LAN is neccessary Wink ).

...on to my post...

What I've read about HT was that it only doubled ALU and nothing else. I haven't heard anything about double-integer Neutral

FPU is also an ALU but I think you can draw a line there where FPU should do multiple multiplys or divisions at once. It can only do 1 clock math.

Sure I can test it with one thread, that is "affinity". You can set it in the task-manager, but I can't be sure that I can find a way to *start* a process with one affinity. I know for sure that I can dynamically change it on the runtime. Maybe there's an API for starting processes with default and specific affinity.

I thought I'd PM you, Kuemmel, but I figured that other would like to know my results too. So...
Score on my http://enos.itcollege.ee/~mkalme/PAHN/Up/cpu-697.png
was: 43.979

The problem now is that I'm at home and can't test the Prescott Razz only a P4 2.66C (the RIMM machine)
...and the results are in:
http://enos.itcollege.ee/~mkalme/PAHN/Up/cpu-2651.png
Code:
FPU :  98.070
SSE2: 234.093
    


Very Happy but maybe you're interested about LOW...I mean really LOW CPU. Do you
remember Pentium II Wink
http://enos.itcollege.ee/~mkalme/PAHN/Up/cpu-350.png
22.166

I can see the relevance here: 350*2=700MHz and 22.166=44.332 which has
only a little higher inter/MHz. If you look closely then it has half-pumped FSB compared to my P!!!, because both run at 100MHz FSB. Another fact is that the Pentium II has 2x more L2 cache.

My P4 at home has the same amount back as the Pentium II did, but it has
less L1 data (code is about the same 12microops >= 16KB).

EDIT:
Okay, now I tested it both ways on my Prescott at work:
Code:
With 1 and 2 threads respectively:
SSE2: 222.666 / 323.726
FPU :  95.316 / 174.792
    
Post 10 May 2006, 19:49
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
kuscsikp



Joined: 07 May 2006
Posts: 19
kuscsikp
Hi people!
http://board.flatassembler.net/topic.php?t=5232
It is another CPU benchmark. /works in linux, win, dos too/
It is not finished yet.../i am working hardly/
If someone wants to help me, please, send me some results!
Wink
Post 11 May 2006, 10:27
View user's profile Send private message ICQ Number Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Madis731 wrote:
EDIT:
Okay, now I tested it both ways on my Prescott at work:
Code:
With 1 and 2 threads respectively:
SSE2: 222.666 / 323.726
FPU :  95.316 / 174.792
    

...first thanx for all the testing ! The PII result again proves that they didn't change too much on the FPU directly regarding PIII or P-M, just P4 sucks without HT and is still more bad with HT...but the clock rate matters...

EDIT: When I look at the factor between PII and the Intel Core Duo...the basic law that every two years the processor power doubles is more than fullfilled...8 years between them...2*2*2*2=16...here the factor is >19 Wink

What's still strange is that the HT effect doesn't take place with the Dual Core (like you can see on a result on my home page)...may be the HT can't deal with that...I must say all the results from AMD architecture are way more consistent...it look kind of hard to optimize for Intel or in a common way for all...that's one more reason to look forward to Conroe architecture for me...in July we'll know more...

@kuscsikp: I'll give it a run when I'm back home !!!
Post 11 May 2006, 14:47
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Hi guys,

I'm still discovering new things Wink ->

Due to the localisation of the variables there seems to be a problem with the order of them, I found this out, when I had to put a new local variable inside the proc specification. The difference wasn't too big (like result of 158,xxx to 163,xxx) but noticable for the SSE2 code...so either there's a problem with the actual position of the variable due to cache or whatever or the alignment...

Can I align local variables somehow in the proc ?
I tried:
proc ...uses ebi ..., data:dq,
locals
align 16
data dq
...
endl

...but didn't work...error...
Post 11 May 2006, 21:45
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Hm, I don't know if FASM can align local variables - so we should keep QWORD variables first, grouped together. And we also need to do some align-by-16 to ebp/esp.
Post 11 May 2006, 22:02
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
f0dder wrote:
Hm, I don't know if FASM can align local variables - so we should keep QWORD variables first, grouped together. And we also need to do some align-by-16 to ebp/esp.


Hm, how can I achieve this ? I see from a HEX editor that the addresses of the local variables are stored liek [epb - 0xx], but where or what is epb and how to modify the location ?
Post 12 May 2006, 20:36
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
It should be possible by manipulating the prologue code... currently the prologue is something like:

Code:
.code:00401480  push    ebp
.code:00401481  mov             ebp, esp
.code:00401483  sub             esp, 4Ch
    


Instead of this, ESP should be adjusted enough so there's room for both the locals but also any necessary alignment. Then, "mov ebp, esp" should be changed into code that makes sure EBP is aligned-to-16.

Of course this poses some problems wrt. accessing function variables... *mumble*
Post 13 May 2006, 10:42
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
I think I'll make a post in the main thread...should be interesting fron non-fractal-people, too Wink
Post 13 May 2006, 11:50
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
...or the ideas&projects section
Post 13 May 2006, 11:50
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Finally I made a new release on
http://www.mikusite.de/pages/x86.htm

It's now Version 0.51 MT and includes some stuff found out during the evalualtion.

- A possible memory violation from the access of the same variable in the thread is cured out and localised.

- Regarding the alignment I didn't find a general solution. I just looked at general optimization documents and sorted the local variables in a way so that it seems the optimum, first the qwords, then the dwords and adding dummies for each to fill up needed space to be aligned to 16 for dwords and qwords. It seems to work okay and I gained a performance plus of about 1-2 percent for the sse2 version.

- The benchmark is also repeated now 5 times to have more stable results, it just was over too fast on the fast dual core CPU's Wink

If somebody is still in the mood to test Wink, post me the results. You can do it also by PM and I'll update the results table on the webpage.
Post 14 May 2006, 21:12
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
FPU: 131; SSE2: 179
Athlon 64 3200+ (2.0 GHz Venice core, Socket 939), 1GB DDR400 Dual-Channel 3.0-3-3-8

PS: Single core of course
Post 15 May 2006, 00:32
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Hi guys,

a taiwanese overclocker got hands to the upcoming Mobile CPU from Intel, the Core 2, internally called 'Merom', clocked to 3200 MHz, check that out:

32 bit - Win
FPU: 406,195
SSE2: 888,284 Twisted Evil

64 bit - Win
FPU: 404,612
SSE2: 893,381 Twisted Evil

I can only say: WOW !!! The SSE2 performance per MHz is enhanced by about 60 % !!! The FPU performance is the same like Core One architecture.
Post 15 May 2006, 05:57
View user's profile Send private message Visit poster's website Reply with quote
UCM



Joined: 25 Feb 2005
Posts: 285
Location: Canada
UCM
SSE2:392,393.110,393.110
FPU:286.232,285.969,283.948

Athlon X2 4200+ 2.2 Ghz dual-core
Post 15 May 2006, 21:39
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Hi guys,

I'm still detecting new things, especially the Hyper Threading keeps me busy. In all the previous versions the multi threading was limited to 2 threads assigned to 2 cores. Now I made a version using 4 threads assigned to 4 cores. Look what happened to the

Intel Pentium D 965 EE Presler (Hyper Threading and Dual Core, 3733 MHz)
2 Thread version result: 235,674 (FPU) 549,834 (SSE2)
4 Thread version result: 333,172 (FPU) 671,360 (SSE2)

...so the benefit of the Hyper Threading visible at the single core results also works for Dual Core if 4 threads are set up instead of 2.

This brings me to the conclusion that my benchmark should somehow detect 1) How many cores are available and 2) If Hyper Threading is available and then set up as many threads as usefull.

How can two things be detected ?

And this brings me to the next question...are there any theories about Hyper Threading, about how many threads can be usefull for a benefit ?

Can the guys with the Single Core P4 with Hyper Threading test the 4 Threads version again to see if it has any positive effect ?

The link is:
http://www.mikusite.de/x86/KMB_V_0.52_4_test.zip
Post 06 Jun 2006, 21:10
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Extreme edition has actually 4 threads - 2 cores and each core has 2 threads. You should try with 8 threads and see if there would be any benefit.
Post 07 Jun 2006, 06:08
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 198
Location: Stuttgart, Germany
Kuemmel
Madis731 wrote:
Extreme edition has actually 4 threads - 2 cores and each core has 2 threads. You should try with 8 threads and see if there would be any benefit.

...okay, problem is, I can't test that system any more...so hyper threading just can cope with 2 threads or even more ?

EDIT: Checked the Intel Page: "HT Technology allows a single Pentium 4 processor to function as two virtual or logical processors. There's still just one physical Pentium 4 processor in your PC — but the processor can execute two threads simultaneously"...so it seems that it's limited, but may be you can prove it Wink

Could you try the 4 thread version also on your single core with hyper threading ? I thought you've got a system like this. Cheers & Thanx !
Post 07 Jun 2006, 15:38
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
HyperThreading is (or should be, anyway :-s) more limited than SMP or multicore machines, since the logical processors share physical execution units...

I'm not sure if there's some decent way to get CPU count, but perhaps the environment variable NUMBER_OF_PROCESSORS is the thing to read?
Post 07 Jun 2006, 17:15
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Kuemmel - did you watch the post I made 10 May 2006, 21:49 closely?

At the very end I made a Prescott test with 1 and 2 threads. I did it by changing boot.ini effectively adding a startup option to make it use only 1 part of the CPU.

I'm sorry, but I can't really understand what you don't understand or need Sad

Hyper-Threading has 2 virtual processors, but it only means 2xALU.
Dual-Core has 2 full virtual processors and the effectiveness is better.
Older EEs just had more cache, but today they have BOTH HT&DC - this actually means that it has 2 virtual processors in one chip and each of these virtual processors have 2 threads running on different ALUs. This is why I said 4 threads.

My Prescott is only 2-threaded, but as there have been some tests by Intel - some applications can perform well with 150-160 threads. I don't know the logic behind it, but it seems that the syncing is better with this many threads. I think even one CPU with one thread can have much help from multiple threads.


EDIT: I think I understood now Very Happy, but I'll have to get to work for that. In some days I'll get my hands on my new pet and I'll make my testing on this solely.
Post 07 Jun 2006, 17:30
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3, 4, 5 ... 18, 19, 20  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on YouTube, Twitter.

Website powered by rwasa.