flat assembler
Message board for the users of flat assembler.

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous  1, 2, 3 ... , 18, 19, 20  Next
Author
Thread Post new topic Reply to topic
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 30 Nov 2009, 07:53
At 2.5 GHz it should be capable of at most 4*2.5 = 10 gflops per core but easily over 4.7 gflops. If you are on a single core and it poped up with 11 gflops, I have to say there is a bug in the timer. I'll try to make it more robust. BTW, I find the pop up box annoying too, is there an api function to easily draw strings to the screen?
Post 30 Nov 2009, 07:53
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 30 Nov 2009, 20:22
10 years ago I remember finding http://www.toymaker.info/Games/html/text.html useful. Maybe there are better ones now.
Post 30 Nov 2009, 20:22
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
bitshifter



Joined: 04 Dec 2007
Posts: 796
Location: Massachusetts, USA
bitshifter 30 Nov 2009, 20:33
tthsqe wrote:
BTW, I find the pop up box annoying too, is there an api function to easily draw strings to the screen?

The best way would be to dump the results into a txt file
so its easy to copy/paste your results back into a thread.

_________________
Coding a 3D game engine with fasm is like trying to eat an elephant,
you just have to keep focused and take it one 'byte' at a time.
Post 30 Nov 2009, 20:33
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 02 Dec 2009, 08:57
Or don't show anything and put the results on clipboard (through WinAPI) and let the user know somehow that you did it.

Like a shortcut key "S" to copy info to clipboard...
Post 02 Dec 2009, 08:57
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 07 Dec 2009, 08:08
Madis731,
Here is a better version. On my computer the gflops value does jump around probibly due to the timer (QPC), but the flopc (flop's per clock cycle) value is very stable - just use these to find out the gflops if you know your clock speed. Mine maxed out around 3.293 flopc


Description:
Download
Filename: MandelbrotPlot.zip
Filesize: 180.6 KB
Downloaded: 462 Time(s)

Post 07 Dec 2009, 08:08
View user's profile Send private message Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 07 Dec 2009, 10:38
It shows 14.5GFLOPS max (~2.9FLOP/c)

EDIT: ~3.12FLOP/c in the black area
Post 07 Dec 2009, 10:38
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
tthsqe



Joined: 20 May 2009
Posts: 767
tthsqe 08 Dec 2009, 08:22
Thanks for trying it. Would someone with a core i7 mind trying it with HT on/off? Here is a version with a minor bug fixed (I think the code is 100% correct now).


Description:
Download
Filename: MandelbrotPlot.zip
Filesize: 180.64 KB
Downloaded: 452 Time(s)

Post 08 Dec 2009, 08:22
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 08 Dec 2009, 20:43
OK, in all cases two 800 MHz DDR2 memory sticks of 2 GB each 5.0-5-5-18-24-T2 (working in unganged dual channel mode). Micro AMD Phenom II X4 955 (3.2 GHz full, 800 MHz idle), 64KB Instruction + 64KB Data L1 cache, 512 KB L2 cache each core, shared 6MB L3 cache. Video nVidia GeForce 8600 GTS. OS Windows 7.

xorpd!
KMB_V0.57_2T_X1.exe: Speed [million iterations / second] : 806,589
KMB_V0.57_2T_X2.exe: Speed [million iterations / second] : 1472,073
KMB_V0.57_2T_X3.exe: Speed [million iterations / second] : 851,829
KMB_V0.57_2T_X4.exe: Speed [million iterations / second] : 1901,334
KMB_V0.57_MT_X1.exe: Speed [million iterations / second] : 1535,603
KMB_V0.57_MT_X2.exe: Speed [million iterations / second] : 2779,852
KMB_V0.57_MT_X3.exe: Speed [million iterations / second] : 1621,580
KMB_V0.57_MT_X4.exe: Speed [million iterations / second] : 3478,698

Kuemmel
---------------------------
Kümmel Mandelbrot Benchmark V 0.53I-32b-MT_FPU
---------------------------
Speed [Million Iterations / Second] : 1436.887

Logical CPU cores detected : 4

CPU Brand detected : AMD Phenom(tm) II X4 955 Processor
---------------------------
---------------------------
Kümmel Mandelbrot Benchmark V 0.53I-32b-MT_SSE2
---------------------------
Speed [Million Iterations / Second] : 3818.517

Logical CPU cores detected : 4

CPU Brand detected : AMD Phenom(tm) II X4 955 Processor
---------------------------
---------------------------
Kümmel Mandelbrot Benchmark V 0.53I-32b-MT_SSE4.1
---------------------------
Sorry, your CPU does not support SSE4.1...
---------------------------

Haven't looked into the sources much so I'll ask: do you warm up the cores before starting the measurements?

[edit]tthsqe, I've also tested your MandelbrotPlot, but the one in page 19 all crash here and the one at page 18 shows me no stats (do I have to do something to make them visible?).
Post 08 Dec 2009, 20:43
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 08 Dec 2009, 22:49
tthsqe wrote:
Thanks for trying it. Would someone with a core i7 mind trying it with HT on/off? Here is a version with a minor bug fixed (I think the code is 100% correct now).

My i7@3200 MHz / Windows Vista 64 shows the following in the black area:
HT on: around 43.3 GFlops peak
HT off: around 43.1 GFlops peak
...so may be almost now difference...

@LocoDelAssembly:
Neither me or Xorpd! use any kind of warm up code...I guess it's not a big issue, but of course I can't confirm it...I think we had this discussion before...what would be a good 'warm up' ? Your results show same efficiency as the results I got on my webpage, I just wonder why the normally faster 64bit MT_X4 version is slower than 32 bit...
Post 08 Dec 2009, 22:49
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4624
Location: Argentina
LocoDelAssembly 08 Dec 2009, 23:31
Quote:

what would be a good 'warm up' ?


I believe something like this per core should be enough:
Code:
call [GetTickCount]
lea  ebx, [eax+1000]

.loop:
call [GetTickCount]
cmp  eax, ebx
jb   .loop    
And also touch all the memory the core will use once before starting the benchmark to ensure that all runs will have more fairness with respect to TLB caching (I'm guessing here actually...)

I think I told this already but I think I'll say it again, I think it would be better to perform the writes to regular RAM rather than directly to the video memory, that way the benchmark will not be contaminated by the video card's own capabilities (unless you really want to know the CPU performance under these conditions).

If you are discarding the first run or always taking the max score of them rather than the average then just simply ignore all I've said above (except the video card thing).
Post 08 Dec 2009, 23:31
View user's profile Send private message Reply with quote
windwakr



Joined: 30 Jun 2004
Posts: 827
windwakr 08 Dec 2009, 23:53
I just ran version "Kümmel Mandelbrot Benchmark V 0.53I-32b-MT"(latest on the site) and these are my results:
Code:
FPU
Speed: 481.548

SSE2
Speed: 1113.876



Cores: 2
Intel Pentium D 3.4GHZ  (it's a 945)
    

It surprised me how much faster the SSE version is compared to the FPU version.

_________________
----> * <---- My star, won HERE
Post 08 Dec 2009, 23:53
View user's profile Send private message Reply with quote
kalambong



Joined: 08 Nov 2008
Posts: 165
kalambong 18 Dec 2009, 00:49
Someone wrote a plugin for Photoshop which draws 3D fractals

Something that looks like

Image

or

Image

More info at http://www.subblue.com/blog/2009/12/13/mandelbulb

Project page at http://www.subblue.com/projects/mandelbulb

More pictures at http://www.subblue.com/gallery/album/89 and http://www.flickr.com/photos/subblue/[/img]
Post 18 Dec 2009, 00:49
View user's profile Send private message Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 18 Dec 2009, 13:31
Oh, the bottom picture is that broccoli thing mentioned in that über-long thread Kuemmel posted a link to - I ended up skimming it to the end, nice pictures there Smile
Post 18 Dec 2009, 13:31
View user's profile Send private message Visit poster's website Reply with quote
kalambong



Joined: 08 Nov 2008
Posts: 165
kalambong 19 Dec 2009, 11:34
Thanks for the reminder for that über-long thread Smile

Went there and saw that it's locked, finally, and coincidentally, they locked that thread today.
Post 19 Dec 2009, 11:34
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 19 Dec 2009, 13:13
...that thread was just closed because it became to big, so everything is no distrubuted in 4 threads here:
http://www.fractalforums.com/the-3d-mandelbulb/
...at the moment it looks like with this 8th order broccoli Mandelbulb they hardly use CPU-based stuff, iteration depth is often only up to 20 or something, so the focus is almost totally on the GPU...I really wonder if it's worth porting to ASM as there's lots of needed trigonometric functions that to be 'correct' might have to be executed by the standard x87 code...anyway the pics there are great, explore them if you got time !
Post 19 Dec 2009, 13:13
View user's profile Send private message Visit poster's website Reply with quote
kalambong



Joined: 08 Nov 2008
Posts: 165
kalambong 08 Apr 2010, 04:47
Oh, btw, any update on your benchmark?
Post 08 Apr 2010, 04:47
View user's profile Send private message Reply with quote
kalambong



Joined: 08 Nov 2008
Posts: 165
kalambong 18 Apr 2010, 09:40
Just in case you guys are still interested in Mandelbulb, someone made a video of it:

http://www.youtube.com/watch?v=W3x4uJJqs_w

http://vimeo.com/10740680

Enjoy !
Post 18 Apr 2010, 09:40
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 22 Jan 2011, 23:54
No code update yet, just some results added for the new Intel Sandy Bridge. This might be interesting for some. The gain (compared to the former Intel I7-9xx architecture for Hyperthreading based CPU's) for the FPU version is around +18% and for SSE2/SSE4 around +3%.

What one could see is that despite the again twice as wide instruction path the gain isn't very big. I think that the insutruction units are really busy with the code, not much more computing 'air' there that could be used. The verdict for me is clearly that any heavy floating point code now needs to switch to AVX to use the full potential of these new CPU's. Hope to find some time this year to give it a try.
Post 22 Jan 2011, 23:54
View user's profile Send private message Visit poster's website Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2139
Location: Estonia
Madis731 24 Jan 2011, 07:20
Actually the instruction path isn't twice the size AFAIK and even the SSE datapath is still 128-bit wide. I think what we see here is SB is clock-for-clock a bit better.
Even if it were true you wouldn't be able to spot the difference right away because the code needs to use AVX. You need to have Windows 7 SP1 (still in RC) to be able to code AVX (which I find very disturbing).

Every 256-bit AVX instruction that accesses data does it by 128-bit strides and while the instructions itself aren't slower you cannot have as many parallel instructions running as with SSE.

http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view&id=99&Itemid=1&limit=1&limitstart=6
Quote:

* 1) The access path of the L2 cache in SB is 128 bit wide and in so far, any application relying on the "mid-level" cache will face the issue that the AVX instructions will have to be split into two parts and then reassembled to the full 256 bit. This means that SB inherently favors legacy SSE instructions over AVX, even if the latter are officially supported.
Update: This information from Intels software developer forum turned out to be incorrect, the access path is 32-Bytes or 256-bits wide. This is consistent with cache bandwidth measurements using Sandra that show 22 Bytes transfer/cycle (which would be impossible to achieve with a 16-Bytes wide access path.
* 2) The instruction cache with its optimizations for decoded Uops will in its present form dis-favor AVX, simply because of size restrictions.
Update: After reading up a bit more on Intel's Software Forum, this argument does not seem to hold.
* 3) Even if the L2 interconnect path is increased to 256 bit width, only fully aligned vectors can pass in a single cycle whereas everything else needs to be split into an upper and a lower segment.
Update: The problem with this argument is that only full cachelines are transferred anyway, meaning that inherently, there is at least some kind of alignment.


Given all these limitations and restrictions, it is reasonable to assume a 25-30% performance increase in AVX-enabled applications, rather than a doubling in performance as suggested by theoretical benchmarks.
Post 24 Jan 2011, 07:20
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20301
Location: In your JS exploiting you and your system
revolution 24 Jan 2011, 07:33
Madis731 wrote:
Quote:
Given all these limitations and restrictions, it is reasonable to assume a 25-30% performance increase in AVX-enabled applications, rather than a doubling in performance as suggested by theoretical benchmarks.
It doesn't surprise me that modern CPUs will become more and more memory bound as generations progress. Almost always now in my code I find that making it cache aware and able to improve data reuse while those data are in the cache has the most striking effect upon overall performance.

But, of course, that quote does appear to assume a certain data usage model that may not be evident in some applications. I expect in some fields of application that AVX can give 100% improvement over SSE.
Post 24 Jan 2011, 07:33
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3 ... , 18, 19, 20  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2024, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.