flat assembler
Message board for the users of flat assembler.

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous  1, 2, 3 ... 5, 6, 7 ... 18, 19, 20  Next
Author
Thread Post new topic Reply to topic
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Suspend/resume is the wrong way to do it - event toggling and waiting is the proper way. But sure, this would be nicer, and setting amount of threads dynamically would be nicer, and... etc Smile
Post 28 Dec 2006, 21:55
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
Xorpd! wrote:
I tried the quickman code on my home PC, Max Iters = 2048, Palette = 3 Muted, Precision = Double, and for Exact, Intel = 710.5; Exact, AMD = 670.0. For comparison, using KMB_V0.53_MT.zip from your website, I got FPU = 337.049, SSE2 = 739.972, SSE3 = 727.863, Vodnaya = 679.789, while KMB_V0.56_MT.ZIP (hope you didn't mind me calling it this; I could rename it if you would like) from my website gets X1 = 787.480, X2 = 1190.608, X3 = 1653.883. So it can be seen that the strategy of creating more instruction streams rather than carefully interweaving only 2 instruction streams is coming out slightly ahead (only slighty taking into account quickman's lack of threading.)


Hi Xorpd ! Nice results !!!

Would you explain in short what kind of modifications you made for X1,X2 and X3 ?

What kind of PC you got ? I bet it's a dual core !? If it's true, I would think Quickman code would turn into almost double speed when multithreaded.

Okay that's still far from your X3 version, but I think yours are all 64bit ? So why not combine all ideas, quickman, instruction streams (is that 64bit only ?) and multi-threading by the core locking like f0dder did it in my app ? Or does instruction streams exclude the idea of quickman ?

Sorry, cant's look into your code at the moment, just on holiday in an internet cafe Wink But it's nice to see things evolving, I still wonder what a Core 2 Duo would do with your code or quickman ? Anyone could test it ? Because this chip outperformans all SSE2 capable chips from the past !
Post 29 Dec 2006, 07:45
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
Quote:

Hi Xorpd ! Nice results !!!

Thanks.
Quote:

Would you explain in short what kind of modifications you made for X1,X2 and X3 ?

Well, I saw a few extraneous operations lying about in the inner loop, so I changed that around a little bit. As written,
Code:
.iteration_loop:
        movapd  xmm2, xmm0                                                      ; xmm2:    rz           |   rz + dz
        mulpd   xmm0, xmm0                                                      ; xmm0:    rz^2         |   (rz + dz)^2
        movapd  xmm3, xmm1                                                      ; xmm3:    iz           |   iz
        addpd   xmm1, xmm1                                                      ; xmm1:    iz+iz        |   iz+iz
        mulpd   xmm1, xmm2                                                      ; xmm1:    2*iz*rz      |   2*iz*(rz+dz)
        movapd  xmm2, xmm0                                                      ; xmm2:    rz^2         |   rz^2
        mulpd   xmm3, xmm3                                                      ; xmm3:    iz^2         |   iz^2
        
        addpd   xmm1, xmm5                                                      ; xmm1:    2*iz*rz+iz0  |   2*iz*(rz+dz)+iz0
        subpd   xmm0, xmm3                                                      ; xmm0:    rz^2-iz^2    |   (rz-dz)^2-iz^2
        addpd   xmm2, xmm3                                                      ; xmm2:    rz^2+iz^2    |   (rz-dz)^2+iz^2
        
        cmplepd xmm2, xmm7                                                      ; xmm2 <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
        addpd   xmm0, xmm4                                                      ; xmm0:    rz^2-iz^2+rz0|   (rz-+dz)^2-iz^2+rz0
        movmskpd        eax, xmm2                                               ; get the sign bits of the two QW in xmm2 in eax -> so either 00,01,10,11

        test    eax, eax
        jz              .end_of_iteration

        andpd   xmm2, xmm7                                                      ; get either 4.0 or 0.0 for each iteration number
        addpd   xmm6, xmm2                                                      ; add        4.0 or 0.0 to the iteration counter
        sub             ecx, 1
        jnz             .iteration_loop
        
.end_of_iteration:
    

there are 7 f.p. adds (cmplepd counts as an addition) and 4 f.p. multiplies (if we count movmskpd as a multiply) in the loop with latency 11 clocks. This underutilizes the processor resources available to us because we could be issuing 11 constructive f.p. additions in these 11 clocks, and not all of the additions are useful.[br]If we examine your results table, we can see some striking results for the 2.8 GHz Nacona system: the x87 loop is mostly thin air, so the placebo processor has ample resources available to issue its own stream of instructions, so it gets nearly twice the performance with HT enabled. The SSE2 loop is not so spacious for this processor because its f.p. adder and multiplier can each issue one 128-bit wide instruction only every other clock cycle, so it gets less than 50% improvement through HT.[br]I counted iterations differently so as not to waste an f.p. add doing this, and a little other fussing with the code is the *X1.exe versions. However, there was the attractive possibility of using the f.p. multiplier do do the doubling, rather than the adder, but that increases the latency to 13 clocks while reducing the multiplies and additions to 5 each. Thus, to make this a throughput-limited loop instead of a latency-limited loop somehow at least 3 instruction streams would have to be present so that 3 independent (pairs of) calculations could be carried out in each loop iteration, with a best possible time of 15 clocks if no bubbles worked their way into the pipeline.[br]I considered that each instruction stream needed at least 2 xmm registers to hold the current values of the pairs of z's and one xmm register to hold the pair of successful iteration counts. Throw in a couple of work registers and we come up with 11 xmm registers required. Now this thought really appealed to me because it highlights the 3 things you really need to achieve good performance: Multithreading (to utilize all cores), assembly language (to utilize SIMD opcodes) and x64 (to utilize the superscalar pipelined nature of the processor.)[br]Starting with your pre-existing example for its assembly language and multithreading, the hard part was to convert to x64. For example the WIN64\MANDEL example that comes with the FASM package doesn't assemble and the executable that comes with that example doesn't run properly. I suppose I could have attempted to trace through all the *.INC files to try to find the problem, but this was quite intimidating to an FASM newbie, not to mention also being a relative newbie at Win32 API. Accordingly, I elected to translate your example to the most basic FASM possible; I only use one *.INC file in the end. The biggest crisis came when I was trying to figure out how to translate the cominvk macro, but a little disassembly overcame any deficiency in documentation. [br]The quickman code gets only two instruction streams because of its artificial restriction to 8 xmm registers, but it gets a little boost because it only totals up iteration counts every other trip through the loop. This eliminates two additions; recall that one addition can be exchanged for a multiplication if this proves advantageous. To enable this slight optimization in my code, I would have to create a fourth instruction stream to get the throughput/latency ratio above unity, and probably have to unroll the code by four to get the maximum effect. That would involve eight loop exit points, each of which needs its own makeup code to determine in which of the last four iterations it diverged. Not to mention that extra code code needed to handle what happens when you get to the end of a scan line. You don't think I would really attempt that, do you?
Post 02 Jan 2007, 04:58
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Xorpd: please do a little paragraph formatting on your posts - you have interesting content, but it's almost impossible to read without pasting to notepad and manually formatting Smile
Post 02 Jan 2007, 08:28
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
I have noted your delicate sense of irony in, for e.g., recommending a Core 2 Duo in a thread started by a complaint that his P4 was only counting every other clock. Just imagine how much fun he would have with an X6800... [br]Of course you continue that irony here by complaining about my apparantly non-functional paragraph formatting rather than pointing out how to create functioning formatting codes on this forum. If a moderator doesn't come around and fix it, you will just have to switch to portrait mode and hit the zoom bar a couple of times.[br]BTW, have you noticed that neither the original benchmark nor my modifications function properly in portrait mode? How do you suppose that could be fixed?
Post 02 Jan 2007, 09:12
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
Just hit enter a couple of times - that works. No reason to be hostile.
Post 02 Jan 2007, 09:32
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
Hi Xorpd!, thanx for all the explanations !

I didn't know much about 64bit stuff, the doubling of the amount of registers is a huge benefit, as I used to code for ARM processors where I had 16 regs (general purpose though) already I know the benefit quite well of having lots of them. I saw in the quickman code that he had to use some memory access to manage the two instruction flows...

It's just the pity that me and may be most users still use the 32bit OS...I still wonder how much the boost would be to use your code with limitations of 8 regs and use memory access for the missing ones compared to my stuff and quickman's.

I'm curious to use the 64bit OS, so are there recommendations which OS to use (is VISTA alredy compatible to your code ?) or are there reasonable emulators for 64bit code on 32bit ?

Are there any users out there with a 64bit OS and a Core2Duo, so that we could compile some results for Xorpd!'s latest code and compare to other CPU's ?
Post 08 Jan 2007, 02:44
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
maybe some of you, or just me got off-topic, but I remember making a simple console Mandelbrot-benchmark that uses RDTSC and the system timer for Windows with SSE1 and 3DNow! instructions.

Funnily It always appers to me that the 3dNow! version is faster than the SSE1-one, but you will need and AMD-CPU for that Sad
I had no time/nerve to implement a CPUID-detection for both SSE/3dNOW!, so if your CPU don't got that, the program WILL crash! (someone should add a CPUID/feature dectection routine in FASMLIB Smile )

But unfortunately I don't do stuff in Windows anymore, so this program is more or less unmaintained:


Description:
Download
Filename: MANDEL3D.ASM
Filesize: 11.82 KB
Downloaded: 110 Time(s)


_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 08 Jan 2007, 21:57
View user's profile Send private message Reply with quote
vid
Verbosity in development


Joined: 05 Sep 2003
Posts: 7105
Location: Slovakia
vid
Quote:
someone should add a CPUID/feature dectection routine in FASMLIB

agree Very Happy
Post 08 Jan 2007, 22:13
View user's profile Send private message Visit poster's website AIM Address MSN Messenger ICQ Number Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3170
Location: Denmark
f0dder
MCD wrote:

Funnily It always appers to me that the 3dNow! version is faster than the SSE1-one, but you will need and AMD-CPU for that

AMD CPUs seem to have relatively poor SSE implementation; what you need to do is look at iterations/MHz with the various methods and compare to other CPUs.
Post 08 Jan 2007, 23:43
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
MCD wrote:
maybe some of you, or just me got off-topic, but I remember making a simple console Mandelbrot-benchmark that uses RDTSC and the system timer for Windows with SSE1 and 3DNow! instructions.

Hi MCD, yeah, I remember lots of implementatios in 3DNow, as far as I know 3DNow was even there before SSE1...just I think it's the simpe precision that makes it not that usable for the Mandelbrot-stuff (same like SSE1), if you go very 'deep' into the Mandelbrot set, even double precision with SSE2 wouldn't do it anymore, I think, then only integer fixed point math can do the job...what means any kind of FPU/SSE unit is kind of useless...or there's some weird math out there I don't know of making it usable...I would think any heavy iteration algoritm needs lots of precision.

May be that's also off topic, but I think also the general purpose registers are doubled at 64bit OS ? That would help on a fixed point math Mandelbrot algoritm quite a lot, too !
Post 11 Jan 2007, 10:25
View user's profile Send private message Visit poster's website Reply with quote
MCD



Joined: 21 Aug 2004
Posts: 604
Location: Germany
MCD
Kuemmel wrote:
just I think it's the simpe precision that makes it not that usable for the Mandelbrot-stuff (same like SSE1), if you go very 'deep' into the Mandelbrot set, even double precision with SSE2 wouldn't do it anymore, I think, then only integer fixed point math can do the job...what means any kind of FPU/SSE unit is kind of useless

You usually don't need to zoom in very deep into the mandelbrot set since all patterns in it will repeat after a while with decreasing differences, si this doesn't make sense unless you want to benchmark some fixed point math librarie or whatever.

_________________
MCD - the inevitable return of the Mad Computer Doggy

-||__/
.|+-~
.|| ||
Post 14 Jan 2007, 04:19
View user's profile Send private message Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
I finally got someone to try my modification of this thread's benchmark. Therefore I started a table of results. I am still curious about how this stuff performs on AMD processors, and about whether hyperthreading is completely negated in the the high instruction stream count (ie *X4*) versions.

It's still surprisingly hard to find a 64-bit computer to run these on. When I go to the store, the people working there are for the most part unaware that 32-bit versions of Windows have the deficiencies that they do compared to 64-bit versions and are surprised that 64-bit programs simply won't run on any of their machines.

This benchmark is intended to show some of these deficiencies, but it's kind of hard to achieve this given that I have such a small sampling of processors in my table. The quickman program negates these deficiencies to a certain extent by implementing a second instruction stream, and I am a bit surprised that the original authors of the 32-bit version of the current benchmark haven't followed suit and included a second instruction stream as well as the other two optimizations that are in force there.
Post 23 Oct 2007, 20:50
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Quote:
Page URL Not Found!!


The requested page does not exist on this server. The URL you typed or followed is either outdated or inaccurate.


I have not a 64-bit Windows so I can't test, you should consider making a Linux version since it is a lot more available, I even tried installing the 64-bit kernel image on a i386 Debian and it worked without any trouble (but I have to check how to make the nvidia drivers work, so far I'm using nv which is open source but suboptimal).

[edit]BTW, where your version is?? I searched but seems that I'm too blind right now to see it. Or am I wrong about a 64-bit version? I would swear that you posted a 64-bit mandelbrot and that even I downloaded it once just to look inside :S[/edit]
Post 23 Oct 2007, 21:52
View user's profile Send private message Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
The link works now, thanks for fixing (in case that it wasn't a temporal problem Razz)

PS: And the program is hosted in your page, I must downloaded it from there then.
Post 24 Oct 2007, 00:00
View user's profile Send private message Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
No fixing, so the problem getting to my web page must have just been one of those internet things.

It's impossible for me to write a Linux program, so if you want one you will have to do the same thing I did when I wanted a 64-bit windows version: translate. You could run the 64-bit windows version under WINE or however you run windows programs in Linux.
Post 24 Oct 2007, 03:54
View user's profile Send private message Visit poster's website Reply with quote
LocoDelAssembly
Your code has a bug


Joined: 06 May 2005
Posts: 4633
Location: Argentina
LocoDelAssembly
Unfortunatelly WINE does not support 64-bit apps yet. I hope this weekend to have time to port it to Linux, but it would be a "blind" version since I know very few of X Window. Well, I think a blind version is better since we want to benchmark the CPUs, not the CPUs plus the video card speed of both, the hardware and the driver.
Post 24 Oct 2007, 04:17
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
Xorpd! wrote:
This benchmark is intended to show some of these deficiencies, but it's kind of hard to achieve this given that I have such a small sampling of processors in my table. The quickman program negates these deficiencies to a certain extent by implementing a second instruction stream, and I am a bit surprised that the original authors of the 32-bit version of the current benchmark haven't followed suit and included a second instruction stream as well as the other two optimizations that are in force there.


Yeah, well I had a talk to the quickman author why he didn't implement a multi-cpu version of his code and he said one problem is no time and he thinks that due to the lack of registers and some other optimizations he did now that can't be used then he thinks that there will be performance problems...so I was also not too keen to invest too much time into it...but of course that's exactly what we would need to make more reasonable comparison to the x64 code, the combination of a second streamline and the other stuff for x32.

It also would be interesting to see how your code would perform on a HT enabled machine, there should be some huge differences as you could see at least on my results table on my webpage, also for the FPU version HT helps a lot because clockwise these old P4's are quite shitty without HT.

I'm really curious of the next implementation in far future where Intel wants to combine Core 2 Duo with HT.

EDIT: I'll post a link to your webpage in another Forum to get some results, I got some from them already with big 'machines' and hopefully they got x64 OS.
Post 25 Oct 2007, 09:06
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel
Got some results on Vista x64 version from another forum for Xorpd!s versions, quite interesting:

http://forums.2cpu.com/showthread.php?t=76178&page=4

The scaling from both seems very different. Hope to get a Barcelona result in the next weeks...and I will try to do a nice results table some day.

Only there seems to be a problem with that 8 core Intel 5310 machine...the results are too low compared to 2 core CPU tests if one compute the speed per core...any idea ? I already asked the guy to check his setup...
Post 29 Oct 2007, 15:16
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
Thanks for getting those results, K├╝mmel. I updated the table on my web page accordingly. The FX-70 results are about the same in terms of iterations per clock cycle per core as the Pentium D which is about what you would expect since both processors can sustain no more than 2 double precision flops per clock cycle.

I agree that the Xeon results look suspect -- I would guess that he doesn't have one of the cores enabled on the x64 tests. It seems this is a common problem. If you could get DaveB to fix that somehow and run all 10 or 11 benchmarks again, his system should end up seriously smoking everything we have seen to date...
Post 04 Nov 2007, 09:12
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3 ... 5, 6, 7 ... 18, 19, 20  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar. Also on GitHub, YouTube, Twitter.

Website powered by rwasa.