flat assembler
Message board for the users of flat assembler.
![]() Goto page Previous 1, 2, 3 ... 8, 9, 10 ... 18, 19, 20 Next |
Author |
|
revolution 27 Dec 2007, 23:42
Xorpd!: Try 7-zip, it supports nearly all compression formats and is free open source.
|
|||
![]() |
|
Xorpd! 28 Dec 2007, 04:02
I persuaded a friend to unrar it for me. On my Core 2 Duo E6700 I got 1155.268 for 0.53B compared to the 337.049 for FPU and 738.218 for SSE2 from 0.53 and 809.218 for 0.57_X1 and 1190.606 for 0.57_X2 all in millions of iterations per second on xp-x64.
Taking a peek at your code, I see that you haven't yet created a separate exit for the 4 different numbers being iterated. As remarked earlier (argh, may have been in the lost post) this makes a significant difference. It's ugly coding, but worth it and by doing so you should end up with faster code than my X2 version. Also it seems that you are adding up the iterations the old way and doing too many movs. Have you compared how my X2 version is doing these things? You are getting a good start and let us know when you have made more progress. |
|||
![]() |
|
f0dder 28 Dec 2007, 12:01
Xorpd!: www.rarlabs.com , check their extras page, and grab an unrar version that's suitable for you - it's free and has source available.
|
|||
![]() |
|
Kuemmel 30 Dec 2007, 18:40
Xorpd! wrote: Taking a peek at your code, I see that you haven't yet created a separate exit for the 4 different numbers being iterated. As remarked earlier (argh, may have been in the lost post) this makes a significant difference. It's ugly coding, but worth it and by doing so you should end up with faster code than my X2 version. Hi Xoprd!, yeah, I still need to optimise the adding thing. But your first comment is really interesting, I think I totally missed out that point before and now I also got an answer from Paul Gentieu, the author of Quickman, he's actually doing the same and I just didn't get it...here is some of his mail: --->>>"The code does leave the loop when one of the points diverged or the iteration maximum was reached. BUT, it maintains the state of the other three points (in the point structures). So, the next time the loop is called, those other three points start iterating again from their previous state, along with a new point. Basically, any points that are still in progress when the loop exits will keep iterating the next time the loop is called. They'll stay in the point state structures until they're done. In this way, the code is always operating at max efficiency- four points are always iterating at once. Those points might be in totally different locations on the screen, so the point structure also has a field for the X/Y location of the point. A point gets "retired" (plotted on the screen and removed from the state structure) when it either diverged or reached max iterations."<<<--- So that's really a topic to address, I tried to see how much could be the gain by adding the iterations and it's definitly somthing like a 10% gain or more. Hope to add this soon in the near future ! Thanks for all the comments...I tend to use RAR as it's more efficient than ZIP. Can you explain, how you did integrate the approach from Paul in KMB, as I think you still keep the line based approach, like feeding different lines of the screen to the different threads ? |
|||
![]() |
|
Xorpd! 31 Dec 2007, 07:13
Separate exits are implemented in the X1 and X4 versions. The X4 code is much more complicated because it also unrolls the loops so that it only tests for exit conditions every fourth iteration. When an exit condition is encountered in the X4 version the last four iterations must be replayed to determine the exact point of departure. The X1 version is not nearly as complicated and so may be more useful to compare how separate exits can be implemented within the framework of your benchmark which is quite different from the quickman framework. It's hard to give an explanation of exactly how I do it. Mostly it's just dogged implementation of the necessary program logic. I did get several versions of unrar so I will be able to decode .rar files until I forget how again.
|
|||
![]() |
|
Kuemmel 31 Dec 2007, 08:53
@Xoprd! - Regarding your hint about the iteration adding, I think you mean the the code lines like:
Code: cmplepd xmm2, xmm7 ... psubd xmm6, xmm2 Code: psubd [mem128bit], xmm2 |
|||
![]() |
|
Xorpd! 31 Dec 2007, 19:00
Assuming that one uses one xmm register to hold the real parts of the pair being iterated, another to hold the imaginary parts, and a third to hold the iteration counter, I count three presistent registers per instruction stream. Since you are maintaining two instruction streams, that's 6 registers leaving 2 scratch registers. If your code completes all operations for one pair as a block before doing the same for the other pair, those two scratch registers should be plenty.
If yo examine my X2 version, the contents of xmm5, xmm7, xmm4, xmm8, and xmm13 could have been left in memory and xmm11, xmm12, and xmm14 were never used in the iteration loop, so I could have gotten by with only 8 xmm registers the way that loop is structured. That would have meant 4 loads per instruction stream per iteration, but there are already 4 mulpd and one movmskpd for 5 port 0 instructions as well as 4 add/subpd and one cmplepd for 5 port 1 instructions (on Core 2 Duo, according to Agner Fog) so that the extra port 2 instructions shouldn't be an issue. Keep on plugging away at this. It may prove advisable to construct an X1 version that embodies all the features you are trying to incorporate except for multiple instruction streams because it can be much easier to debug. Once your are sure that your ideas have been correctly implemented in the X1 version you can go on to code them into the X2 version with confidence. With each additional feature the usefulness of this approach will become more evident. |
|||
![]() |
|
Kuemmel 03 Jan 2008, 22:40
Xorpd! wrote: ...Since you are maintaining two instruction streams, that's 6 registers leaving 2 scratch registers. If your code completes all operations for one pair as a block before doing the same for the other pair, those two scratch registers should be plenty... Yeah, this was really my concern, I thought it's needed to have totally independant operation blocks where are no common registers, doens't seem to be needed...so I just did as you said mainly by using your kind of basic iteration instruction flow and combinded it with two different exits and iterating only the needed instruction flow on to the end also using the iteration counting like you did saving memory access. I attached the file here and the exe. Unfortunatelly I couldn't run it this time on a Core 2 Duo (someone ?), but I'm quite confident, as my old development AMD single core Sempron notebook also showed better results, advancing from 162,666 MItPS to 191,156 MItPS. So the gain for Core 2 Duo should be even over my last posted 0.53B-Version, like Xorpd! was assuming also.
|
|||||||||||
![]() |
|
Xorpd! 04 Jan 2008, 00:05
I tried KMB_V0.53C_MT.zip on my Core 2 Duo E6700 (2.67 GHz) xp-x64:
Code: Version MIt/s MKB_V0.53_MT.exe (SSE2) 738.218 MKB_V0.53B_MT.exe 1155.268 MKB_V0.53C_MT.exe 1255.422 Version C was indeed faster, but I can't really see why. 4 points are still iterated until all are complete, only the last pair of points is iterated a little faster because it gets decoupled from the first pair. Logically it doesn't make sense to me that this procedure should be that much faster because iterating one pair shouldn't have that much latency than iterating two pairs, and frequently both pairs get done at the same time. I still think you can gain a significant speed advantage by creating 4 exits and loading the next point in line when any exit is reached. In this way there will always be 4 useful iterations in progress until the end of the line is reached. This is what I do in my X1 and X4 versions. It's quite difficult coding, but there is a performance advantage. One bug: the output pop-up still says version B whereas it should say version C. As the coding gets more difficult, the probability of error rises to certainty. Do you have a version that computes, say, the original SSE2 version and compares pixel by pixel with the latest version? It took me a couple of days to get my X4 version to match my X1 version like this. Looking at my X4 version, you can see why. It also may be useful to create a version that outputs the total iteration counts, so you know there is not an error in keeping track of that number. You may even be able to create an X3 version if you keep the iteration counters in memory. Updating the iteration counters would take three instructions in that case unless you wanted to maintain the additive inverses of the iteration counts, in which case it would take 2 instructions. There are so many possibilities... I hope you continue to make progress. |
|||
![]() |
|
Xorpd! 04 Jan 2008, 00:31
Quote:
Whoops. Of course you implemented cleaned-up code in your inner loop. Definitely that makes a difference. Having two exit points rather than one probably doesn't help much on Core 2 Duo. Going to 4 exit points, as I have said before, should help significantly and as I see it is the main reason for the difference in single-core performance between your code and quickman. |
|||
![]() |
|
asmfan 04 Jan 2008, 07:53
Proposal - please name versions as x32 and x64. And btw 2 Kuemmel please correct includes on your site. In DD Interface - Lock(Surface) and UnLock(Surface) (without or with Surface) - I saw both versions of your sources with "-Surface" and without.
|
|||
![]() |
|
Kuemmel 05 Jan 2008, 15:41
@asmfan: Can imagine people getting confused, so I just updated my webpage with the 0.53C-Version and the corresponding INCLUDE and results. May be somebody with more OS experience can fix the code so that it assembles with the basic standard FASM INCLUDE directory. I'm a little lazy and totally unexperienced when it comes to this.
I also choose now to name it like: "KMB V 0.53C-32b-MT". May be cryptic, but it sounds logic to me. Xorpd! wrote: Going to 4 exit points, as I have said before, should help significantly and as I see it is the main reason for the difference in single-core performance between your code and quickman. Yeah, yet I'm still far from him on my AMD Sempron, about 191 MItPS compared to about 303 MItPS of his Intel code Version. His advantage on single core at the moment still remains 1) Loop unroling 1 time and 2) *always* 4 points iteration 3) Global variables (he says it helps lot...). I think when going to 4 exits it'll help of course. Just then I got to iterate at some iteration count one point only, so it's slower than his code. I see that ADD/SUB/MOV-SD aren't any faster on Core 2 Duo than -PD instructions from the Latencies table. So the gain lies in the easier coded compare and iteration count code. I'll hope to finish that kind of version soon ! |
|||
![]() |
|
Xorpd! 05 Jan 2008, 18:53
In future versions I'll add the -64b- string to my names, even though it's just extra typing and the rule that I use 64 bit code and everyone else uses 32 bit code has not yet been violated in this thread.
|
|||
![]() |
|
bitRAKE 05 Jan 2008, 20:02
Xorpd! wrote: Global variables, like hash tables, are signs of amateurish code. They are what's stopping that group from multithreading. Forget about it. ![]() |
|||
![]() |
|
f0dder 05 Jan 2008, 20:11
Then you're not thread-safe though.
|
|||
![]() |
|
bitRAKE 05 Jan 2008, 21:27
f0dder wrote: Then you're not thread-safe though. |
|||
![]() |
|
f0dder 05 Jan 2008, 21:31
bitRAKE wrote:
You don't always know the max # of threads though, if you want to dynamically adapt to future processors for things that scales well with parallelization. But okay, supporting a max of, say, 16 threads will last you a few years ![]() _________________ carpe noctem |
|||
![]() |
|
Kuemmel 05 Jan 2008, 22:10
Xorpd! wrote: [*]This is the point of going to 4 exits. You seem to misunderstand what I'm talking about here. Go back to my X1 code and examine it more carefully. You will see that I always iterate two points there, just as I always iterate 8 points in my X4 code. 1) iterating 4 points in one loop (2 instruction flows) -> one diverged -> 2) iterating 3 points in one loop (2 instruction flows) -> one diverged -> 3) iterating 2 points in one loop -> one diverged -> 4) iterating 1 point in one loop -> finish Would mean handling a lot of exits but so at least in 1) and 2) 2 instruction flows can be maintained...may be it's worth a try !? |
|||
![]() |
|
Xorpd! 05 Jan 2008, 22:51
You haven't read my code correctly. Assume in the X1 version that the low qword has diverged. It will get knocked out of the .iteration_loop by the sequence cmp eax, 3/ jne .end_of_iteration. Then after test ecx, ecx/ jns .no_xl_limit, it will jump to .no_xl_limit. From there, test eax, 1/ jz .store_xl jumps to .store_xl. There it plots the low qword pixel. On encountering the sequence cmp dword[rsp+80], WINDOWSIZE/ je done_xl, it will not jump unless the current line of the image has no pixels left to render.
After figuring out whether it's an even or odd point, it loads registers to start the low qword at the next point and then gets to .xl_loaded where the initial imaginary part and the counter is reinitialized. It gets to .no_xl_limit where it must fall through the first test, but assuming the high qword hasn't yet diverged it will jump at the sequence test rcx, rcx/ jns .no_xh_limit and then fall through at test eax, 2/ jz .store_xh. This means it will then jump at jmp .iteration_loop and so resume iterating two points at once. Tough logic to follow, and that's why I said it is ugly code and that you may want to get this working in an X1 version of your own before putting that kind of stuff in an X2 version. Speaking of your X2 version, it occurred to me that you could offload the FADD pipe a little without any unrolling. I tried changing the first part of thread_draw_sse2 procedure from KMB_V0.53C_MT.asm from KMB_V0.53C_MT.zip: Code: align 16 thread_draw_sse2: label .plot_y dword at eax+20 label .ebp dword at eax+12 label .edi dword at eax+8 label .esi dword at eax+4 label .ebx dword at eax label .local_iz0 dqword at esp+0 label .local_rz0_12 dqword at esp+16 label .local_rz0_34 dqword at esp+32 label .local_iter_count_12 dqword at esp+48 label .local_iter_count_34 dqword at esp+64 label .local_offset_0_dz dqword at esp+80 label .local_offset_2dz_2dz dqword at esp+96 label .two dqword at esp+112 label .mov12 dqword at esp+128 label .mov34 dqword at esp+144 label .radiant dqword at esp+160 ; change 112 to 176 in labels below label .rz_low qword at esp +176 label .rz_high qword at esp+8 +176 label .iz_low qword at esp+16 +176 label .dz qword at esp+24 +176 label .iz_temp_fpu qword at esp+32 +176 label .rz_temp_fpu qword at esp+40 +176 label .plot_x dword at esp+48 +176 label .plot_limit dword at esp+52 +176 label .local_iter_count dword at esp+56 +176 label .dummy dword at esp+60 +176 ; change 112 to 176 in labels above lea eax,[esp-4*4] ; register save space ; sub esp, 5*4 + 8*16 + 6*8 + 4*4 + (ALIGNMENT-1) ; expand stack frame sub esp, 5*4 + 12*16 + 6*8 + 4*4 + (ALIGNMENT-1) mov [.ebp],ebp mov [.edi],edi mov [.esi],esi mov [.ebx],ebx and esp,0-ALIGNMENT ; generate two mov edi, 2 cvtsi2sd xmm0, edi ; low qword of xmm0 = 2.0 = 4000000000000000h ;movddup xmm0, xmm0 ; maybe this won't work because it's SSE3, so... movlhps xmm0, xmm0 movaps [.two], xmm0 ; get local copy of radiant movaps xmm0, dqword[radiant] movaps [.radiant], xmm0 mov edi, [frame_counter] ; "copy the desired data for the frame lea edi,[edi*3] ; "address = 3 * 8 * framecounter fld qword [frameconfig + 0 + edi*8] fld qword [frameconfig + 8 + edi*8] fld qword [frameconfig + 16 + edi*8] fstp [.iz_low] fstp [.rz_high] fstp [.rz_low] fld [.rz_high] ; dz = abs(rz_high - rz_low)/screen_size fsub [.rz_low] fabs fidiv [window_size] fstp [.dz] xorpd xmm0, xmm0 movhpd xmm0, [.dz] ; xmm2: 0 | dz (moves dz in high QW, low QW is unchanged) movapd [.local_offset_0_dz],xmm0 shufpd xmm0, xmm0, 3 addpd xmm0, xmm0 movapd [.local_offset_2dz_2dz],xmm0 mov [.local_iter_count],0 mov edx, Colour_Table ; pointer to colour table .main_loop: mov ebx, [ddsd.lpSurface] ; get address to surface add ebx, (SCREENWIDTH-WINDOWSIZE)/2*4 ; center horizontally ; Calculate scanline offset to draw to mov esi, [ddsd.lPitch] mov edi, [.plot_y] imul edi, esi add ebx, edi sub esi, WINDOWSIZE*4 ; center horizontally ; calculate plotlimit mov edi, [.plot_y] add edi, LINE_INTERLEAVE mov [.plot_limit], edi .y_loop: ; iz = dz / screen_size * plot_y + iz_low -> better precision than adding... fld [.dz] fimul [.plot_y] fadd [.iz_low] fstp [.iz_temp_fpu] movlpd xmm0, [.iz_temp_fpu] shufpd xmm0, xmm0, 0 ; xmm0: iz | iz movapd [.local_iz0],xmm0 ; store for later usage to free another xmm reg mov [.plot_x], 0 .x_loop: xorpd xmm4, xmm4 ; clear xmm4 xorpd xmm6, xmm6 ; clear_local_iteration_counter pixels _1_2_ xorpd xmm7, xmm7 ; clear_local_iteration_counter pixels _3_4_ fld [.dz] fimul [.plot_x] fadd [.rz_low] fstp [.rz_temp_fpu] movlpd xmm4, [.rz_temp_fpu] ; xmm0: rz | 0 (moves rz to low QW) shufpd xmm4, xmm4, 0 ; xmm0: rz | rz (because of '0' the low QW is moved to high QW) addpd xmm4, [.local_offset_0_dz] ; xmm0: rz | rz + dz for _1_2_ movapd [.local_rz0_12],xmm4 addpd xmm4, [.local_offset_2dz_2dz] movapd [.local_rz0_34],xmm4 movapd xmm0, [.local_rz0_12] ; INIT xmm0: rz | rz + dz movapd xmm1, [.local_iz0] ; INIT xmm1: iz | iz movapd xmm5, [.local_iz0] ; INIT xmm5: iz | iz mov ecx, ITERATIONS align 16 .iteration_loop: ; 1 | 2 movaps xmm3, [.two] ; 2.0 | 2.0 mulpd xmm3, xmm1 ; 2*iz | 2*iz mulpd xmm1, xmm1 ; iz^2 | iz^2 mulpd xmm3, xmm0 ; 2*rz*iz | 2*rz*iz mulpd xmm0, xmm0 ; rz^2 | rz^2 movaps [.mov12], xmm0 ; rz^2 | rz^2 subpd xmm0, xmm1 ; rz^2-iz^2 | rz^2-iz^2 addpd xmm1, [.mov12] ; rz^2+iz^2 | rz^2+iz^2 cmplepd xmm1, [.radiant] addpd xmm0, [.local_rz0_12] ; new rz movmskpd edi, xmm1 psubd xmm6, xmm1 movaps xmm1, [.local_iz0] addpd xmm1, xmm3 ; new iz ; movaps xmm3, xmm1 ;iz | iz ; movaps xmm3, xmm1 ;iz | iz ; addpd xmm1, xmm1 ;2*iz | 2*iz ; mulpd xmm3, xmm3 ;iz^2 | iz^2 ; mulpd xmm1, xmm0 ;2*iz*rz | 2*iz*rz ; mulpd xmm0, xmm0 ;rz^2 | (rz+dz)^2 ; movaps xmm2, xmm0 ;rz^2 | (rz+dz)^2 ; subpd xmm0, xmm3 ;rz^2-iz^2 | (rz+dz)^2-iz^2 ; addpd xmm1, [.local_iz0] ;2*iz*rz+dz | 2*iz*rz+dz ; addpd xmm2, xmm3 ;rz^2+iz^2 | (rz+dz)^2-iz^2 ; ; cmplepd xmm2, dqword[radiant] ; <= 4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h ; addpd xmm0, [.local_rz0_12] ; rz^2 - iz^2 + rz0 | (rz-+dz)^2-iz^2+rz0 ; movmskpd edi, xmm2 ; get the sign bits of the two QW in xmm2 in eax -> so either 00,01,10,11 ; psubd xmm6, xmm2 ; add to iteration count _1_2_(FFFFFFFFFFFFFFFFh = -1 !!!) ; 3 | 4 movaps xmm3, [.two] ; 2.0 | 2.0 mulpd xmm3, xmm5 ; 2*iz | 2*iz mulpd xmm5, xmm5 ; iz^2 | iz^2 mulpd xmm3, xmm4 ; 2*rz*iz | 2*rz*iz mulpd xmm4, xmm4 ; rz^2 | rz^2 movaps [.mov34], xmm4 ; rz^2 | rz^2 subpd xmm4, xmm5 ; rz^2-iz^2 | rz^2-iz^2 addpd xmm5, [.mov34] ; rz^2+iz^2 | rz^2+iz^2 cmplepd xmm5, [.radiant] addpd xmm4, [.local_rz0_34] ; new rz movmskpd ebp, xmm5 psubd xmm7, xmm5 movaps xmm5, [.local_iz0] addpd xmm5, xmm3 ; new iz ; movaps xmm3, xmm5 ;iz | iz ; addpd xmm5, xmm5 ;2*iz | 2*iz ; mulpd xmm3, xmm3 ;iz^2 | iz^2 ; mulpd xmm5, xmm4 ;2*iz*rz | 2*iz*rz ; mulpd xmm4, xmm4 ;rz^2 | (rz+dz)^2 ; movaps xmm2, xmm4 ;rz^2 | (rz+dz)^2 ; subpd xmm4, xmm3 ;rz^2-iz^2 | (rz+dz)^2-iz^2 ; addpd xmm5, [.local_iz0] ;2*iz*rz+dz | 2*iz*rz+dz ; addpd xmm2, xmm3 ;rz^2+iz^2 | (rz+dz)^2-iz^2 ; ; cmplepd xmm2, dqword[radiant] ; <= 4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h ; addpd xmm4, [.local_rz0_34] ; rz^2 - iz^2 + rz0 | (rz-+dz)^2-iz^2+rz0 ; movmskpd ebp, xmm2 ; get the sign bits of the two QW in xmm2 in eax -> so either 00,01,10,11 ; psubd xmm7, xmm2 ; add to iteration count _3_4_(FFFFFFFFFFFFFFFFh = -1 !!!) test edi,edi jz .continue_only_with_34 ; as it says... test ebp,ebp jz .continue_only_with_12 ; as it says... sub ecx, 1 jnz .iteration_loop .end_of_iteration: It made the program about 1% slower on Core 2 Duo, but maybe it helps on your Sempron. Let me know how it does on your machine. Reading the documentation a litttle more, I can see that unrolling should indeed be an important optimization step on your Sempron. |
|||
![]() |
|
Goto page Previous 1, 2, 3 ... 8, 9, 10 ... 18, 19, 20 Next < Last Thread | Next Thread > |
Forum Rules:
|
Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.