flat assembler
Message board for the users of flat assembler.

Index > Windows > Mandelbrot Benchmark FPU/SSE2 released

Goto page Previous  1, 2, 3 ... 8, 9, 10 ... 18, 19, 20  Next
Author
Thread Post new topic Reply to topic
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 27 Dec 2007, 23:27
@Kümmel: I wanted to take a look at KMB_V0.53B_MT.rar, but I don't have a .rar program and such programs on the web either wanted money or to install adware. Would it be possible to post a .ZIP version?

@Madis731: I tried "back", even looked around in temporary internet files. There was no possibility to reconstruct it, and it took so much time to collect all those URLs. When a web page has eaten your input, "back" never works, at least not for me. I guess the answer is to make shorter and less substantive posts.
Post 27 Dec 2007, 23:27
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 20632
Location: In your JS exploiting you and your system
revolution 27 Dec 2007, 23:42
Xorpd!: Try 7-zip, it supports nearly all compression formats and is free open source.
Post 27 Dec 2007, 23:42
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 28 Dec 2007, 04:02
I persuaded a friend to unrar it for me. On my Core 2 Duo E6700 I got 1155.268 for 0.53B compared to the 337.049 for FPU and 738.218 for SSE2 from 0.53 and 809.218 for 0.57_X1 and 1190.606 for 0.57_X2 all in millions of iterations per second on xp-x64.

Taking a peek at your code, I see that you haven't yet created a separate exit for the 4 different numbers being iterated. As remarked earlier (argh, may have been in the lost post) this makes a significant difference. It's ugly coding, but worth it and by doing so you should end up with faster code than my X2 version.

Also it seems that you are adding up the iterations the old way and doing too many movs. Have you compared how my X2 version is doing these things? You are getting a good start and let us know when you have made more progress.
Post 28 Dec 2007, 04:02
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 28 Dec 2007, 12:01
Xorpd!: www.rarlabs.com , check their extras page, and grab an unrar version that's suitable for you - it's free and has source available.
Post 28 Dec 2007, 12:01
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 30 Dec 2007, 18:40
Xorpd! wrote:
Taking a peek at your code, I see that you haven't yet created a separate exit for the 4 different numbers being iterated. As remarked earlier (argh, may have been in the lost post) this makes a significant difference. It's ugly coding, but worth it and by doing so you should end up with faster code than my X2 version.

Also it seems that you are adding up the iterations the old way and doing too many movs. Have you compared how my X2 version is doing these things? You are getting a good start and let us know when you have made more progress.

Hi Xoprd!, yeah, I still need to optimise the adding thing. But your first comment is really interesting, I think I totally missed out that point before and now I also got an answer from Paul Gentieu, the author of Quickman, he's actually doing the same and I just didn't get it...here is some of his mail:

--->>>"The code does leave the loop when one of the points diverged or the iteration maximum was reached. BUT, it maintains the state of the other three points (in the point structures). So, the next time the loop is called, those other three points start iterating again from their previous state, along with a new point.

Basically, any points that are still in progress when the loop exits will keep iterating the next time the loop is called. They'll stay in the point state structures until they're done.

In this way, the code is always operating at max efficiency- four points are always iterating at once. Those points might be in totally different locations on the screen, so the point structure also has a field for the X/Y location of the point. A point gets "retired" (plotted on the screen and removed from the state structure) when it either diverged or reached max iterations."<<<---

So that's really a topic to address, I tried to see how much could be the gain by adding the iterations and it's definitly somthing like a 10% gain or more. Hope to add this soon in the near future ! Thanks for all the comments...I tend to use RAR as it's more efficient than ZIP.

Can you explain, how you did integrate the approach from Paul in KMB, as I think you still keep the line based approach, like feeding different lines of the screen to the different threads ?
Post 30 Dec 2007, 18:40
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 31 Dec 2007, 07:13
Separate exits are implemented in the X1 and X4 versions. The X4 code is much more complicated because it also unrolls the loops so that it only tests for exit conditions every fourth iteration. When an exit condition is encountered in the X4 version the last four iterations must be replayed to determine the exact point of departure. The X1 version is not nearly as complicated and so may be more useful to compare how separate exits can be implemented within the framework of your benchmark which is quite different from the quickman framework. It's hard to give an explanation of exactly how I do it. Mostly it's just dogged implementation of the necessary program logic. I did get several versions of unrar so I will be able to decode .rar files until I forget how again.
Post 31 Dec 2007, 07:13
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 31 Dec 2007, 08:53
@Xoprd! - Regarding your hint about the iteration adding, I think you mean the the code lines like:
Code:
cmplepd xmm2, xmm7
...
psubd xmm6, xmm2    
...where xmm6 is the iteration counter and as one or both half of xmm2 is -1 if an iteration needs to be added it works with the substraction, nice idea...the problem is that I don't got any "free" registers any more, so I would need something like
Code:
psubd [mem128bit], xmm2    
what does not exist and I think also not for any adding instruction, only MOV...hm...any other usefull SSE2 instruction that I missed and would do the task or any workaround ? Lack of registers really sucks...
Post 31 Dec 2007, 08:53
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 31 Dec 2007, 19:00
Assuming that one uses one xmm register to hold the real parts of the pair being iterated, another to hold the imaginary parts, and a third to hold the iteration counter, I count three presistent registers per instruction stream. Since you are maintaining two instruction streams, that's 6 registers leaving 2 scratch registers. If your code completes all operations for one pair as a block before doing the same for the other pair, those two scratch registers should be plenty.

If yo examine my X2 version, the contents of xmm5, xmm7, xmm4, xmm8, and xmm13 could have been left in memory and xmm11, xmm12, and xmm14 were never used in the iteration loop, so I could have gotten by with only 8 xmm registers the way that loop is structured. That would have meant 4 loads per instruction stream per iteration, but there are already 4 mulpd and one movmskpd for 5 port 0 instructions as well as 4 add/subpd and one cmplepd for 5 port 1 instructions (on Core 2 Duo, according to Agner Fog) so that the extra port 2 instructions shouldn't be an issue.

Keep on plugging away at this. It may prove advisable to construct an X1 version that embodies all the features you are trying to incorporate except for multiple instruction streams because it can be much easier to debug. Once your are sure that your ideas have been correctly implemented in the X1 version you can go on to code them into the X2 version with confidence. With each additional feature the usefulness of this approach will become more evident.
Post 31 Dec 2007, 19:00
View user's profile Send private message Visit poster's website Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 03 Jan 2008, 22:40
Xorpd! wrote:
...Since you are maintaining two instruction streams, that's 6 registers leaving 2 scratch registers. If your code completes all operations for one pair as a block before doing the same for the other pair, those two scratch registers should be plenty...

Yeah, this was really my concern, I thought it's needed to have totally independant operation blocks where are no common registers, doens't seem to be needed...so I just did as you said mainly by using your kind of basic iteration instruction flow and combinded it with two different exits and iterating only the needed instruction flow on to the end also using the iteration counting like you did saving memory access.

I attached the file here and the exe. Unfortunatelly I couldn't run it this time on a Core 2 Duo (someone ?), but I'm quite confident, as my old development AMD single core Sempron notebook also showed better results, advancing from 162,666 MItPS to 191,156 MItPS. So the gain for Core 2 Duo should be even over my last posted 0.53B-Version, like Xorpd! was assuming also.


Description:
Download
Filename: KMB_V0.53C_MT.zip
Filesize: 10.79 KB
Downloaded: 301 Time(s)

Post 03 Jan 2008, 22:40
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 04 Jan 2008, 00:05
I tried KMB_V0.53C_MT.zip on my Core 2 Duo E6700 (2.67 GHz) xp-x64:
Code:
      Version             MIt/s
MKB_V0.53_MT.exe (SSE2)  738.218
MKB_V0.53B_MT.exe       1155.268
MKB_V0.53C_MT.exe       1255.422
    

Version C was indeed faster, but I can't really see why. 4 points are still iterated until all are complete, only the last pair of points is iterated a little faster because it gets decoupled from the first pair. Logically it doesn't make sense to me that this procedure should be that much faster because iterating one pair shouldn't have that much latency than iterating two pairs, and frequently both pairs get done at the same time.

I still think you can gain a significant speed advantage by creating 4 exits and loading the next point in line when any exit is reached. In this way there will always be 4 useful iterations in progress until the end of the line is reached. This is what I do in my X1 and X4 versions. It's quite difficult coding, but there is a performance advantage.

One bug: the output pop-up still says version B whereas it should say version C. As the coding gets more difficult, the probability of error rises to certainty. Do you have a version that computes, say, the original SSE2 version and compares pixel by pixel with the latest version? It took me a couple of days to get my X4 version to match my X1 version like this. Looking at my X4 version, you can see why. It also may be useful to create a version that outputs the total iteration counts, so you know there is not an error in keeping track of that number.

You may even be able to create an X3 version if you keep the iteration counters in memory. Updating the iteration counters would take three instructions in that case unless you wanted to maintain the additive inverses of the iteration counts, in which case it would take 2 instructions. There are so many possibilities... I hope you continue to make progress.
Post 04 Jan 2008, 00:05
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 04 Jan 2008, 00:31
Quote:

Version C was indeed faster, but I can't really see why.

Whoops. Of course you implemented cleaned-up code in your inner loop. Definitely that makes a difference. Having two exit points rather than one probably doesn't help much on Core 2 Duo. Going to 4 exit points, as I have said before, should help significantly and as I see it is the main reason for the difference in single-core performance between your code and quickman.
Post 04 Jan 2008, 00:31
View user's profile Send private message Visit poster's website Reply with quote
asmfan



Joined: 11 Aug 2006
Posts: 392
Location: Russian
asmfan 04 Jan 2008, 07:53
Proposal - please name versions as x32 and x64. And btw 2 Kuemmel please correct includes on your site. In DD Interface - Lock(Surface) and UnLock(Surface) (without or with Surface) - I saw both versions of your sources with "-Surface" and without.
Post 04 Jan 2008, 07:53
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 05 Jan 2008, 15:41
@asmfan: Can imagine people getting confused, so I just updated my webpage with the 0.53C-Version and the corresponding INCLUDE and results. May be somebody with more OS experience can fix the code so that it assembles with the basic standard FASM INCLUDE directory. I'm a little lazy and totally unexperienced when it comes to this.

I also choose now to name it like: "KMB V 0.53C-32b-MT". May be cryptic, but it sounds logic to me.

Xorpd! wrote:
Going to 4 exit points, as I have said before, should help significantly and as I see it is the main reason for the difference in single-core performance between your code and quickman.

Yeah, yet I'm still far from him on my AMD Sempron, about 191 MItPS compared to about 303 MItPS of his Intel code Version. His advantage on single core at the moment still remains 1) Loop unroling 1 time and 2) *always* 4 points iteration 3) Global variables (he says it helps lot...).

I think when going to 4 exits it'll help of course. Just then I got to iterate at some iteration count one point only, so it's slower than his code. I see that ADD/SUB/MOV-SD aren't any faster on Core 2 Duo than -PD instructions from the Latencies table. So the gain lies in the easier coded compare and iteration count code. I'll hope to finish that kind of version soon !
Post 05 Jan 2008, 15:41
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 05 Jan 2008, 18:53
In future versions I'll add the -64b- string to my names, even though it's just extra typing and the rule that I use 64 bit code and everyone else uses 32 bit code has not yet been violated in this thread.

  1. Loop unrolling should help more on Sempron than Core 2 Duo because the former is more nearly throughput-limited even at the X2 stage. I have noticed that MOVMSKPD is FADD on Sempron whereas it is port 0 (the multiplication port) on Core 2 Duo, see instruction_tables.pdf. Thus the sequence with movaps xmm3, xmm1/ addpd xmm1, xmm1/ mulpd xmm3, xmm3 is bad because it uses the FADD pipe unnecessarily. Better would be movaps xmm3, [two]/ mulpd xmm3, xmm1/ mulpd xmm1, xmm1 because the 'optional' operation goes through the less cluttered FMUL pipe. Also it avoids a problem seen in section 9.7 of microarchitecture.pdf where an operation of unusual latency is placed in the dependency path. Unrolling is needed for this optimization because it moves the imaginary part away from its original register.
  2. This is the point of going to 4 exits. You seem to misunderstand what I'm talking about here. Go back to my X1 code and examine it more carefully. You will see that I always iterate two points there, just as I always iterate 8 points in my X4 code.
  3. Global variables, like hash tables, are signs of amateurish code. They are what's stopping that group from multithreading. Forget about it.
Post 05 Jan 2008, 18:53
View user's profile Send private message Visit poster's website Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4227
Location: vpcmpistri
bitRAKE 05 Jan 2008, 20:02
Xorpd! wrote:
Global variables, like hash tables, are signs of amateurish code. They are what's stopping that group from multithreading. Forget about it.
In the rare case global variables might gain a register - if you don't mind using ESP for general purpose, lol. Razz
Post 05 Jan 2008, 20:02
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 05 Jan 2008, 20:11
Then you're not thread-safe though.
Post 05 Jan 2008, 20:11
View user's profile Send private message Reply with quote
bitRAKE



Joined: 21 Jul 2003
Posts: 4227
Location: vpcmpistri
bitRAKE 05 Jan 2008, 21:27
f0dder wrote:
Then you're not thread-safe though.
True, there would need to be some local link to global data - may as well be a local link to local data. Or, specific code for each thread - we know what the maximum number of threads are and don't need to concern ourselves with the general case - which is not possible.
Post 05 Jan 2008, 21:27
View user's profile Send private message Visit poster's website Reply with quote
f0dder



Joined: 19 Feb 2004
Posts: 3175
Location: Denmark
f0dder 05 Jan 2008, 21:31
bitRAKE wrote:
f0dder wrote:
Then you're not thread-safe though.
True, there would need to be some local link to global data - may as well be a local link to local data. Or, specific code for each thread - we know what the maximum number of threads are and don't need to concern ourselves with the general case - which is not possible.


You don't always know the max # of threads though, if you want to dynamically adapt to future processors for things that scales well with parallelization.

But okay, supporting a max of, say, 16 threads will last you a few years Smile

_________________
carpe noctem
Post 05 Jan 2008, 21:31
View user's profile Send private message Reply with quote
Kuemmel



Joined: 30 Jan 2006
Posts: 200
Location: Stuttgart, Germany
Kuemmel 05 Jan 2008, 22:10
Xorpd! wrote:
[*]This is the point of going to 4 exits. You seem to misunderstand what I'm talking about here. Go back to my X1 code and examine it more carefully. You will see that I always iterate two points there, just as I always iterate 8 points in my X4 code.
Hm, I think I read your code correctly, you are iterating of course 2 points in the main ".iteration_loop". But if one of the points is diverged you are afterwards iterating the non-diverged point in the ".xl_iteration_loop". So in that loop there's only a one-point-iteration taking place, whereas the Quickman author "feeds" always new points (wherever they are screenwise) to the main iteration loop. So that's how one basic idea came to my mind:

1) iterating 4 points in one loop (2 instruction flows) -> one diverged ->
2) iterating 3 points in one loop (2 instruction flows) -> one diverged ->
3) iterating 2 points in one loop -> one diverged ->
4) iterating 1 point in one loop -> finish

Would mean handling a lot of exits but so at least in 1) and 2) 2 instruction flows can be maintained...may be it's worth a try !?
Post 05 Jan 2008, 22:10
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd! 05 Jan 2008, 22:51
You haven't read my code correctly. Assume in the X1 version that the low qword has diverged. It will get knocked out of the .iteration_loop by the sequence cmp eax, 3/ jne .end_of_iteration. Then after test ecx, ecx/ jns .no_xl_limit, it will jump to .no_xl_limit. From there, test eax, 1/ jz .store_xl jumps to .store_xl. There it plots the low qword pixel. On encountering the sequence cmp dword[rsp+80], WINDOWSIZE/ je done_xl, it will not jump unless the current line of the image has no pixels left to render.

After figuring out whether it's an even or odd point, it loads registers to start the low qword at the next point and then gets to .xl_loaded where the initial imaginary part and the counter is reinitialized. It gets to .no_xl_limit where it must fall through the first test, but assuming the high qword hasn't yet diverged it will jump at the sequence test rcx, rcx/ jns .no_xh_limit and then fall through at test eax, 2/ jz .store_xh. This means it will then jump at jmp .iteration_loop and so resume iterating two points at once.

Tough logic to follow, and that's why I said it is ugly code and that you may want to get this working in an X1 version of your own before putting that kind of stuff in an X2 version.

Speaking of your X2 version, it occurred to me that you could offload the FADD pipe a little without any unrolling. I tried changing the first part of thread_draw_sse2 procedure from KMB_V0.53C_MT.asm from KMB_V0.53C_MT.zip:
Code:
align 16
thread_draw_sse2: 
 label .plot_y                           dword at eax+20
     label .ebp                                      dword at eax+12 
    label .edi                                      dword at eax+8 
     label .esi                                      dword at eax+4 
     label .ebx                                      dword at eax 
       
    label .local_iz0                                dqword at esp+0
     label .local_rz0_12                             dqword at esp+16
    label .local_rz0_34                             dqword at esp+32
    label .local_iter_count_12              dqword at esp+48
    label .local_iter_count_34              dqword at esp+64
    label .local_offset_0_dz                dqword at esp+80
    label .local_offset_2dz_2dz             dqword at esp+96
label .two dqword at esp+112
label .mov12 dqword at esp+128
label .mov34 dqword at esp+144
label .radiant dqword at esp+160
; change 112 to 176 in labels below 
    label .rz_low                           qword at esp    +176
        label .rz_high                          qword at esp+8  +176
        label .iz_low                           qword at esp+16 +176
        label .dz                                       qword at esp+24 +176
        label .iz_temp_fpu                      qword at esp+32 +176
        label .rz_temp_fpu                      qword at esp+40 +176

    label .plot_x                           dword at esp+48 +176
        label .plot_limit                       dword at esp+52 +176
        label .local_iter_count         dword at esp+56 +176
        label .dummy                            dword at esp+60 +176
; change 112 to 176 in labels above

     lea eax,[esp-4*4]                                               ; register save space
;      sub esp, 5*4 + 8*16 + 6*8 + 4*4 + (ALIGNMENT-1)
; expand stack frame
     sub esp, 5*4 + 12*16 + 6*8 + 4*4 + (ALIGNMENT-1)
    mov [.ebp],ebp 
     mov [.edi],edi 
     mov [.esi],esi 
     mov [.ebx],ebx 
     and esp,0-ALIGNMENT

; generate two
mov edi, 2
cvtsi2sd xmm0, edi  ; low qword of xmm0 = 2.0 = 4000000000000000h
;movddup xmm0, xmm0 ; maybe this won't work because it's SSE3, so...
movlhps xmm0, xmm0
movaps [.two], xmm0
; get local copy of radiant
movaps xmm0, dqword[radiant]
movaps [.radiant], xmm0
       mov edi, [frame_counter]                            ; "copy the desired data for the frame
     lea edi,[edi*3]                                     ; "address =  3 * 8 * framecounter
 fld qword [frameconfig + 0  + edi*8]
        fld qword [frameconfig + 8  + edi*8]
        fld qword [frameconfig + 16 + edi*8]
        fstp [.iz_low]
      fstp [.rz_high]
     fstp [.rz_low]

  fld  [.rz_high]                                                      ; dz = abs(rz_high - rz_low)/screen_size
       fsub [.rz_low]
      fabs
        fidiv [window_size]
 fstp  [.dz]
 
    xorpd   xmm0, xmm0
  movhpd  xmm0, [.dz]                                                     ; xmm2:    0    |   dz      (moves dz in high QW, low QW is unchanged)
  movapd  [.local_offset_0_dz],xmm0
   shufpd  xmm0, xmm0, 3
       addpd   xmm0, xmm0
  movapd  [.local_offset_2dz_2dz],xmm0

    mov [.local_iter_count],0
   mov edx, Colour_Table                                       ; pointer to colour table

.main_loop:
        mov             ebx, [ddsd.lpSurface]                           ; get address to surface
    add             ebx, (SCREENWIDTH-WINDOWSIZE)/2*4       ; center horizontally

   ; Calculate scanline offset to draw to
      mov             esi, [ddsd.lPitch]
  mov             edi, [.plot_y]
      imul    edi, esi
    add             ebx, edi
    sub             esi, WINDOWSIZE*4                                       ; center horizontally

   ; calculate plotlimit
       mov             edi, [.plot_y]
      add             edi, LINE_INTERLEAVE
        mov             [.plot_limit], edi

.y_loop:
      ; iz = dz / screen_size * plot_y + iz_low -> better precision than adding...
     fld             [.dz]
       fimul   [.plot_y]
   fadd    [.iz_low]
   fstp    [.iz_temp_fpu]
      
    movlpd  xmm0, [.iz_temp_fpu]
        shufpd  xmm0, xmm0, 0                                           ; xmm0:    iz   |  iz
   movapd  [.local_iz0],xmm0                                       ; store for later usage to free another xmm reg
     
    mov             [.plot_x], 0
.x_loop:
        xorpd   xmm4, xmm4                                      ; clear xmm4
        xorpd   xmm6, xmm6                                      ; clear_local_iteration_counter pixels _1_2_
        xorpd   xmm7, xmm7                                      ; clear_local_iteration_counter pixels _3_4_

    fld             [.dz]
       fimul   [.plot_x]
   fadd    [.rz_low]
   fstp    [.rz_temp_fpu]

  movlpd  xmm4, [.rz_temp_fpu]                    ; xmm0:    rz   |   0       (moves rz to low QW)
        shufpd  xmm4, xmm4, 0                                           ; xmm0:    rz   |   rz      (because of '0' the low QW is moved to high QW)
   addpd   xmm4, [.local_offset_0_dz]                      ; xmm0:    rz   |   rz + dz                         for _1_2_
   movapd  [.local_rz0_12],xmm4
        
    addpd   xmm4, [.local_offset_2dz_2dz]
       movapd  [.local_rz0_34],xmm4                            

        movapd  xmm0, [.local_rz0_12]                           ; INIT xmm0:    rz  |   rz + dz
 movapd  xmm1, [.local_iz0]                                      ; INIT xmm1:    iz   |   iz     
        movapd  xmm5, [.local_iz0]                                      ; INIT xmm5:    iz   |   iz 

        mov             ecx, ITERATIONS

align 16
.iteration_loop:                         ;         1     |       2
movaps xmm3, [.two]   ;    2.0    |    2.0
mulpd  xmm3, xmm1     ;   2*iz    |   2*iz
mulpd  xmm1, xmm1     ;   iz^2    |   iz^2
mulpd  xmm3, xmm0     ;  2*rz*iz  |  2*rz*iz
mulpd  xmm0, xmm0     ;   rz^2    |   rz^2
movaps [.mov12], xmm0 ;   rz^2    |   rz^2
subpd  xmm0, xmm1     ; rz^2-iz^2 | rz^2-iz^2
addpd  xmm1, [.mov12] ; rz^2+iz^2 | rz^2+iz^2
cmplepd xmm1, [.radiant]
addpd  xmm0, [.local_rz0_12] ; new rz
movmskpd edi, xmm1
psubd xmm6, xmm1
movaps xmm1, [.local_iz0]
addpd  xmm1, xmm3     ; new iz

;        movaps  xmm3, xmm1                      ;iz             |       iz
; movaps  xmm3, xmm1                      ;iz             |       iz
; addpd   xmm1, xmm1                      ;2*iz           |       2*iz
;       mulpd   xmm3, xmm3                      ;iz^2           |       iz^2
;       mulpd   xmm1, xmm0                      ;2*iz*rz        |       2*iz*rz
;    mulpd   xmm0, xmm0                      ;rz^2           |       (rz+dz)^2
;  movaps  xmm2, xmm0                      ;rz^2           |       (rz+dz)^2
;  subpd   xmm0, xmm3                      ;rz^2-iz^2      |       (rz+dz)^2-iz^2
;     addpd   xmm1, [.local_iz0]      ;2*iz*rz+dz     |       2*iz*rz+dz
; addpd   xmm2, xmm3                      ;rz^2+iz^2      |       (rz+dz)^2-iz^2
;
;        cmplepd         xmm2, dqword[radiant]   ; <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
;      addpd           xmm0, [.local_rz0_12]   ; rz^2 - iz^2 + rz0     |   (rz-+dz)^2-iz^2+rz0
;    movmskpd        edi,  xmm2                              ; get the sign bits of the two QW in xmm2 in eax -> so either 00,01,10,11
;       psubd           xmm6, xmm2                              ; add to iteration count _1_2_(FFFFFFFFFFFFFFFFh = -1 !!!)
                                                          ;         3     |       4
movaps xmm3, [.two]   ;    2.0    |    2.0
mulpd  xmm3, xmm5     ;   2*iz    |   2*iz
mulpd  xmm5, xmm5     ;   iz^2    |   iz^2
mulpd  xmm3, xmm4     ;  2*rz*iz  |  2*rz*iz
mulpd  xmm4, xmm4     ;   rz^2    |   rz^2
movaps [.mov34], xmm4 ;   rz^2    |   rz^2
subpd  xmm4, xmm5     ; rz^2-iz^2 | rz^2-iz^2
addpd  xmm5, [.mov34] ; rz^2+iz^2 | rz^2+iz^2
cmplepd xmm5, [.radiant]
addpd  xmm4, [.local_rz0_34] ; new rz
movmskpd ebp, xmm5
psubd xmm7, xmm5
movaps xmm5, [.local_iz0]
addpd  xmm5, xmm3     ; new iz

;        movaps  xmm3, xmm5                      ;iz             |       iz
; addpd   xmm5, xmm5                      ;2*iz           |       2*iz
;       mulpd   xmm3, xmm3                      ;iz^2           |       iz^2
;       mulpd   xmm5, xmm4                      ;2*iz*rz        |       2*iz*rz 
;   mulpd   xmm4, xmm4                      ;rz^2           |       (rz+dz)^2
;  movaps  xmm2, xmm4                      ;rz^2           |       (rz+dz)^2
;  subpd   xmm4, xmm3                      ;rz^2-iz^2      |       (rz+dz)^2-iz^2
;     addpd   xmm5, [.local_iz0]      ;2*iz*rz+dz     |       2*iz*rz+dz
; addpd   xmm2, xmm3                      ;rz^2+iz^2      |       (rz+dz)^2-iz^2
;
;        cmplepd         xmm2, dqword[radiant]   ; <=  4.0 | 4.0 ? True -> QW = FFFFFFFFFFFFFFFFh else 0000000000000000h
;      addpd           xmm4, [.local_rz0_34]   ; rz^2 - iz^2 + rz0     |   (rz-+dz)^2-iz^2+rz0
;    movmskpd        ebp,  xmm2                              ; get the sign bits of the two QW in xmm2 in eax -> so either 00,01,10,11
;       psubd           xmm7, xmm2                              ; add to iteration count _3_4_(FFFFFFFFFFFFFFFFh = -1 !!!)

      test    edi,edi                         
    jz              .continue_only_with_34          ; as it says...
     test    ebp,ebp                         
    jz              .continue_only_with_12          ; as it says...
     sub             ecx, 1
      jnz             .iteration_loop

.end_of_iteration:
    

It made the program about 1% slower on Core 2 Duo, but maybe it helps on your Sempron. Let me know how it does on your machine. Reading the documentation a litttle more, I can see that unrolling should indeed be an important optimization step on your Sempron.
Post 05 Jan 2008, 22:51
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3 ... 8, 9, 10 ... 18, 19, 20  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.