flat assembler
Message board for the users of flat assembler.
![]() Goto page Previous 1, 2, 3 ... 5, 6, 7 ... 18, 19, 20 Next |
Author |
|
Kuemmel 29 Dec 2006, 07:45
Xorpd! wrote: I tried the quickman code on my home PC, Max Iters = 2048, Palette = 3 Muted, Precision = Double, and for Exact, Intel = 710.5; Exact, AMD = 670.0. For comparison, using KMB_V0.53_MT.zip from your website, I got FPU = 337.049, SSE2 = 739.972, SSE3 = 727.863, Vodnaya = 679.789, while KMB_V0.56_MT.ZIP (hope you didn't mind me calling it this; I could rename it if you would like) from my website gets X1 = 787.480, X2 = 1190.608, X3 = 1653.883. So it can be seen that the strategy of creating more instruction streams rather than carefully interweaving only 2 instruction streams is coming out slightly ahead (only slighty taking into account quickman's lack of threading.) Hi Xorpd ! Nice results !!! Would you explain in short what kind of modifications you made for X1,X2 and X3 ? What kind of PC you got ? I bet it's a dual core !? If it's true, I would think Quickman code would turn into almost double speed when multithreaded. Okay that's still far from your X3 version, but I think yours are all 64bit ? So why not combine all ideas, quickman, instruction streams (is that 64bit only ?) and multi-threading by the core locking like f0dder did it in my app ? Or does instruction streams exclude the idea of quickman ? Sorry, cant's look into your code at the moment, just on holiday in an internet cafe ![]() |
|||
![]() |
|
Xorpd! 02 Jan 2007, 04:58
Quote:
Thanks. Quote:
Well, I saw a few extraneous operations lying about in the inner loop, so I changed that around a little bit. As written, Code: .iteration_loop: movapd xmm2, xmm0 ; xmm2: rz | rz + dz mulpd xmm0, xmm0 ; xmm0: rz^2 | (rz + dz)^2 movapd xmm3, xmm1 ; xmm3: iz | iz addpd xmm1, xmm1 ; xmm1: iz+iz | iz+iz mulpd xmm1, xmm2 ; xmm1: 2*iz*rz | 2*iz*(rz+dz) movapd xmm2, xmm0 ; xmm2: rz^2 | rz^2 mulpd xmm3, xmm3 ; xmm3: iz^2 | iz^2 addpd xmm1, xmm5 ; xmm1: 2*iz*rz+iz0 | 2*iz*(rz+dz)+iz0 subpd xmm0, xmm3 ; xmm0: rz^2-iz^2 | (rz-dz)^2-iz^2 addpd xmm2, xmm3 ; xmm2: rz^2+iz^2 | (rz-dz)^2+iz^2 cmplepd xmm2, xmm7 ; xmm2 <= 4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h addpd xmm0, xmm4 ; xmm0: rz^2-iz^2+rz0| (rz-+dz)^2-iz^2+rz0 movmskpd eax, xmm2 ; get the sign bits of the two QW in xmm2 in eax -> so either 00,01,10,11 test eax, eax jz .end_of_iteration andpd xmm2, xmm7 ; get either 4.0 or 0.0 for each iteration number addpd xmm6, xmm2 ; add 4.0 or 0.0 to the iteration counter sub ecx, 1 jnz .iteration_loop .end_of_iteration: there are 7 f.p. adds (cmplepd counts as an addition) and 4 f.p. multiplies (if we count movmskpd as a multiply) in the loop with latency 11 clocks. This underutilizes the processor resources available to us because we could be issuing 11 constructive f.p. additions in these 11 clocks, and not all of the additions are useful.[br]If we examine your results table, we can see some striking results for the 2.8 GHz Nacona system: the x87 loop is mostly thin air, so the placebo processor has ample resources available to issue its own stream of instructions, so it gets nearly twice the performance with HT enabled. The SSE2 loop is not so spacious for this processor because its f.p. adder and multiplier can each issue one 128-bit wide instruction only every other clock cycle, so it gets less than 50% improvement through HT.[br]I counted iterations differently so as not to waste an f.p. add doing this, and a little other fussing with the code is the *X1.exe versions. However, there was the attractive possibility of using the f.p. multiplier do do the doubling, rather than the adder, but that increases the latency to 13 clocks while reducing the multiplies and additions to 5 each. Thus, to make this a throughput-limited loop instead of a latency-limited loop somehow at least 3 instruction streams would have to be present so that 3 independent (pairs of) calculations could be carried out in each loop iteration, with a best possible time of 15 clocks if no bubbles worked their way into the pipeline.[br]I considered that each instruction stream needed at least 2 xmm registers to hold the current values of the pairs of z's and one xmm register to hold the pair of successful iteration counts. Throw in a couple of work registers and we come up with 11 xmm registers required. Now this thought really appealed to me because it highlights the 3 things you really need to achieve good performance: Multithreading (to utilize all cores), assembly language (to utilize SIMD opcodes) and x64 (to utilize the superscalar pipelined nature of the processor.)[br]Starting with your pre-existing example for its assembly language and multithreading, the hard part was to convert to x64. For example the WIN64\MANDEL example that comes with the FASM package doesn't assemble and the executable that comes with that example doesn't run properly. I suppose I could have attempted to trace through all the *.INC files to try to find the problem, but this was quite intimidating to an FASM newbie, not to mention also being a relative newbie at Win32 API. Accordingly, I elected to translate your example to the most basic FASM possible; I only use one *.INC file in the end. The biggest crisis came when I was trying to figure out how to translate the cominvk macro, but a little disassembly overcame any deficiency in documentation. [br]The quickman code gets only two instruction streams because of its artificial restriction to 8 xmm registers, but it gets a little boost because it only totals up iteration counts every other trip through the loop. This eliminates two additions; recall that one addition can be exchanged for a multiplication if this proves advantageous. To enable this slight optimization in my code, I would have to create a fourth instruction stream to get the throughput/latency ratio above unity, and probably have to unroll the code by four to get the maximum effect. That would involve eight loop exit points, each of which needs its own makeup code to determine in which of the last four iterations it diverged. Not to mention that extra code code needed to handle what happens when you get to the end of a scan line. You don't think I would really attempt that, do you? |
|||
![]() |
|
f0dder 02 Jan 2007, 08:28
Xorpd: please do a little paragraph formatting on your posts - you have interesting content, but it's almost impossible to read without pasting to notepad and manually formatting
![]() |
|||
![]() |
|
Xorpd! 02 Jan 2007, 09:12
I have noted your delicate sense of irony in, for e.g., recommending a Core 2 Duo in a thread started by a complaint that his P4 was only counting every other clock. Just imagine how much fun he would have with an X6800... [br]Of course you continue that irony here by complaining about my apparantly non-functional paragraph formatting rather than pointing out how to create functioning formatting codes on this forum. If a moderator doesn't come around and fix it, you will just have to switch to portrait mode and hit the zoom bar a couple of times.[br]BTW, have you noticed that neither the original benchmark nor my modifications function properly in portrait mode? How do you suppose that could be fixed?
|
|||
![]() |
|
f0dder 02 Jan 2007, 09:32
Just hit enter a couple of times - that works. No reason to be hostile.
|
|||
![]() |
|
Kuemmel 08 Jan 2007, 02:44
Hi Xorpd!, thanx for all the explanations !
I didn't know much about 64bit stuff, the doubling of the amount of registers is a huge benefit, as I used to code for ARM processors where I had 16 regs (general purpose though) already I know the benefit quite well of having lots of them. I saw in the quickman code that he had to use some memory access to manage the two instruction flows... It's just the pity that me and may be most users still use the 32bit OS...I still wonder how much the boost would be to use your code with limitations of 8 regs and use memory access for the missing ones compared to my stuff and quickman's. I'm curious to use the 64bit OS, so are there recommendations which OS to use (is VISTA alredy compatible to your code ?) or are there reasonable emulators for 64bit code on 32bit ? Are there any users out there with a 64bit OS and a Core2Duo, so that we could compile some results for Xorpd!'s latest code and compare to other CPU's ? |
|||
![]() |
|
MCD 08 Jan 2007, 21:57
maybe some of you, or just me got off-topic, but I remember making a simple console Mandelbrot-benchmark that uses RDTSC and the system timer for Windows with SSE1 and 3DNow! instructions.
Funnily It always appers to me that the 3dNow! version is faster than the SSE1-one, but you will need and AMD-CPU for that ![]() I had no time/nerve to implement a CPUID-detection for both SSE/3dNOW!, so if your CPU don't got that, the program WILL crash! (someone should add a CPUID/feature dectection routine in FASMLIB ![]() But unfortunately I don't do stuff in Windows anymore, so this program is more or less unmaintained:
_________________ MCD - the inevitable return of the Mad Computer Doggy -||__/ .|+-~ .|| || |
|||||||||||
![]() |
|
vid 08 Jan 2007, 22:13
Quote: someone should add a CPUID/feature dectection routine in FASMLIB agree ![]() |
|||
![]() |
|
f0dder 08 Jan 2007, 23:43
MCD wrote:
AMD CPUs seem to have relatively poor SSE implementation; what you need to do is look at iterations/MHz with the various methods and compare to other CPUs. |
|||
![]() |
|
Kuemmel 11 Jan 2007, 10:25
MCD wrote: maybe some of you, or just me got off-topic, but I remember making a simple console Mandelbrot-benchmark that uses RDTSC and the system timer for Windows with SSE1 and 3DNow! instructions. Hi MCD, yeah, I remember lots of implementatios in 3DNow, as far as I know 3DNow was even there before SSE1...just I think it's the simpe precision that makes it not that usable for the Mandelbrot-stuff (same like SSE1), if you go very 'deep' into the Mandelbrot set, even double precision with SSE2 wouldn't do it anymore, I think, then only integer fixed point math can do the job...what means any kind of FPU/SSE unit is kind of useless...or there's some weird math out there I don't know of making it usable...I would think any heavy iteration algoritm needs lots of precision. May be that's also off topic, but I think also the general purpose registers are doubled at 64bit OS ? That would help on a fixed point math Mandelbrot algoritm quite a lot, too ! |
|||
![]() |
|
MCD 14 Jan 2007, 04:19
Kuemmel wrote: just I think it's the simpe precision that makes it not that usable for the Mandelbrot-stuff (same like SSE1), if you go very 'deep' into the Mandelbrot set, even double precision with SSE2 wouldn't do it anymore, I think, then only integer fixed point math can do the job...what means any kind of FPU/SSE unit is kind of useless You usually don't need to zoom in very deep into the mandelbrot set since all patterns in it will repeat after a while with decreasing differences, si this doesn't make sense unless you want to benchmark some fixed point math librarie or whatever. _________________ MCD - the inevitable return of the Mad Computer Doggy -||__/ .|+-~ .|| || |
|||
![]() |
|
Xorpd! 23 Oct 2007, 20:50
I finally got someone to try my modification of this thread's benchmark. Therefore I started a table of results. I am still curious about how this stuff performs on AMD processors, and about whether hyperthreading is completely negated in the the high instruction stream count (ie *X4*) versions.
It's still surprisingly hard to find a 64-bit computer to run these on. When I go to the store, the people working there are for the most part unaware that 32-bit versions of Windows have the deficiencies that they do compared to 64-bit versions and are surprised that 64-bit programs simply won't run on any of their machines. This benchmark is intended to show some of these deficiencies, but it's kind of hard to achieve this given that I have such a small sampling of processors in my table. The quickman program negates these deficiencies to a certain extent by implementing a second instruction stream, and I am a bit surprised that the original authors of the 32-bit version of the current benchmark haven't followed suit and included a second instruction stream as well as the other two optimizations that are in force there. |
|||
![]() |
|
LocoDelAssembly 23 Oct 2007, 21:52
Quote: Page URL Not Found!! I have not a 64-bit Windows so I can't test, you should consider making a Linux version since it is a lot more available, I even tried installing the 64-bit kernel image on a i386 Debian and it worked without any trouble (but I have to check how to make the nvidia drivers work, so far I'm using nv which is open source but suboptimal). [edit]BTW, where your version is?? I searched but seems that I'm too blind right now to see it. Or am I wrong about a 64-bit version? I would swear that you posted a 64-bit mandelbrot and that even I downloaded it once just to look inside :S[/edit] |
|||
![]() |
|
LocoDelAssembly 24 Oct 2007, 00:00
The link works now, thanks for fixing (in case that it wasn't a temporal problem
![]() PS: And the program is hosted in your page, I must downloaded it from there then. |
|||
![]() |
|
Xorpd! 24 Oct 2007, 03:54
No fixing, so the problem getting to my web page must have just been one of those internet things.
It's impossible for me to write a Linux program, so if you want one you will have to do the same thing I did when I wanted a 64-bit windows version: translate. You could run the 64-bit windows version under WINE or however you run windows programs in Linux. |
|||
![]() |
|
LocoDelAssembly 24 Oct 2007, 04:17
Unfortunatelly WINE does not support 64-bit apps yet. I hope this weekend to have time to port it to Linux, but it would be a "blind" version since I know very few of X Window. Well, I think a blind version is better since we want to benchmark the CPUs, not the CPUs plus the video card speed of both, the hardware and the driver.
|
|||
![]() |
|
Kuemmel 25 Oct 2007, 09:06
Xorpd! wrote: This benchmark is intended to show some of these deficiencies, but it's kind of hard to achieve this given that I have such a small sampling of processors in my table. The quickman program negates these deficiencies to a certain extent by implementing a second instruction stream, and I am a bit surprised that the original authors of the 32-bit version of the current benchmark haven't followed suit and included a second instruction stream as well as the other two optimizations that are in force there. Yeah, well I had a talk to the quickman author why he didn't implement a multi-cpu version of his code and he said one problem is no time and he thinks that due to the lack of registers and some other optimizations he did now that can't be used then he thinks that there will be performance problems...so I was also not too keen to invest too much time into it...but of course that's exactly what we would need to make more reasonable comparison to the x64 code, the combination of a second streamline and the other stuff for x32. It also would be interesting to see how your code would perform on a HT enabled machine, there should be some huge differences as you could see at least on my results table on my webpage, also for the FPU version HT helps a lot because clockwise these old P4's are quite shitty without HT. I'm really curious of the next implementation in far future where Intel wants to combine Core 2 Duo with HT. EDIT: I'll post a link to your webpage in another Forum to get some results, I got some from them already with big 'machines' and hopefully they got x64 OS. |
|||
![]() |
|
Kuemmel 29 Oct 2007, 15:16
Got some results on Vista x64 version from another forum for Xorpd!s versions, quite interesting:
http://forums.2cpu.com/showthread.php?t=76178&page=4 The scaling from both seems very different. Hope to get a Barcelona result in the next weeks...and I will try to do a nice results table some day. Only there seems to be a problem with that 8 core Intel 5310 machine...the results are too low compared to 2 core CPU tests if one compute the speed per core...any idea ? I already asked the guy to check his setup... |
|||
![]() |
|
Xorpd! 04 Nov 2007, 09:12
Thanks for getting those results, Kümmel. I updated the table on my web page accordingly. The FX-70 results are about the same in terms of iterations per clock cycle per core as the Pentium D which is about what you would expect since both processors can sustain no more than 2 double precision flops per clock cycle.
I agree that the Xeon results look suspect -- I would guess that he doesn't have one of the cores enabled on the x64 tests. It seems this is a common problem. If you could get DaveB to fix that somehow and run all 10 or 11 benchmarks again, his system should end up seriously smoking everything we have seen to date... |
|||
![]() |
|
Goto page Previous 1, 2, 3 ... 5, 6, 7 ... 18, 19, 20 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2023, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.