flat assembler
Message board for the users of flat assembler.

Index > Main > A few questions (cache line optimization)

Author
Thread Post new topic Reply to topic
mattst88



Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88 20 May 2006, 02:31
I'm writing the Leibniz theory of Pi in assembly, and I have a few questions.

First off, the Leibniz theory of Pi is this:
Pi / 4 = 1/1 - 1/3 + 1/5 - 1/7 + 1/9 ....
The sign is reversed each time, and the denominator is incremented by two.

Here's the code:

Code:
format PE
entry start

section '.code' code readable executable

start:
        finit
        fild [two]
        fild [one]
        fild [one]
        fild [one]

top:
        fadd st0,st3
        fxch st1
        fchs
        fld st0
        fdiv st0, st2
        fadd st3,st0
        fstp st0
        fxch st1
        
        jmp top


section '.data' data readable writable

one dd 1
two dd 2    


First question: See anything that I might be able to improve upon?

Second question: For cache line purposes, I believe I should unroll the loop to fill 64 bytes, since I've using a Sempron. The loop, excluding the jmp instruction, is 16 bytes. Should I unroll it three times, and kill off the next 14 bytes with (f)nops? Would it be more efficient to keep the jmp instruction in the same cache line as the rest of the loop? If I moved it to the next cache line, I'd be able to unroll the loop 4 times, while keeping it in the same cache line I can only unroll it 3 times. Which would be best?

Thank you in advance.
Post 20 May 2006, 02:31
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22 20 May 2006, 19:00
The label that your jmp points too should be aligned.

align 16
top:
...loop code ...
jmp top

You might also have a SLIGHT problem in that the code you posted does not terminate Razz You may want to put a counter in there and only let it iterate a FINITE amount of time.

As for unrolling, you can probably get up to a 20% speed increase if you figure out the optimal amount to unroll the code. When dealing with cache optimizations the best way to find the correct amount is to test the possibilities with benchmarking.

As for other speed optimizations, all I can say is use SSE/2 if your processor supports it. Double precision should be good enough compared to the 80bit precision of the FPU code. SSk.
E code runs much faster then the FPU code and can also be made parallel for an even greater optimization.

Here's a code snippet of a possible SIMD (sse/2) implementation. Hopefully it's correct enough to give you a start.
Code:
.code
Start:
  movdqa xmm1, dqword[first]
  movdqa xmm2, dqword[divover]
  movdqa xmm3, dqword[add4]
  movdqa xmm4,xmm2 ;;copy
  movdqa xmm0, dqword[zero]
  mov ecx, 12; keep it a multiple of how many X you unroll
;;;;4X unroll 12/4 = 3 no remainder good
Top:
  divpd xmm2, xmm1 ;; 1 / 1  -1/3 etc
  addpd xmm1,xmm3 ;; 1 | 3 becomes 5 | 7 becomes 9 | 11 etc
  addpd xmm0,xmm2 ;; results positives | negatives
  divpd xmm2, xmm1 ;; 1 / 1  -1/3 etc
  addpd xmm1,xmm3 ;; 1 | 3 becomes 5 | 7 becomes 9 | 11 etc
  addpd xmm0,xmm2 ;; results positives | negatives
  divpd xmm2, xmm1 ;; 1 / 1  -1/3 etc
  addpd xmm1,xmm3 ;; 1 | 3 becomes 5 | 7 becomes 9 | 11 etc
  addpd xmm0,xmm2 ;; results positives | negatives
  divpd xmm2, xmm1 ;; 1 / 1  -1/3 etc
  addpd xmm1,xmm3 ;; 1 | 3 becomes 5 | 7 becomes 9 | 11 etc
  addpd xmm0,xmm2 ;; results positives | negatives
  sub ecx,4 ;;decrement our counter
  jns Top ;; while ECX is >= 0 repeat
;;now we need to calculate our result
  movhlpd xmm1, xmm0 ;; moves the negative sums to xmm1
  addsd xmm0,xmm1 ;;our result is now in the low qword of xmm0
  movsd qword[Pi4Answer], xmm0
...
.data
Pi4Answer dq 0
align 16
add4: dq 4.0, 4.0
first: dq 1.0, 3.0 ;;first two divisors
divover: dq 1.0, -1.0 ;; because sign switches
zero: dq 0.0, 0.0
    
Post 20 May 2006, 19:00
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
mattst88



Joined: 12 May 2006
Posts: 260
Location: South Carolina
mattst88 23 May 2006, 16:06
Wow, great stuff. Thank you very much. Smile

Edit: One thing: I'm trying to assemble your code with FASM 1.66, and on 'movhlpd' it's returning 'illegal instruction'. Google doesn't have much information on this either. Most bizarrely, my intel manuals don't have this instruction either. I understand what the instruction does, but how can I make FASM recognize it?
Post 23 May 2006, 16:06
View user's profile Send private message Visit poster's website Reply with quote
donkey7



Joined: 31 Jan 2005
Posts: 127
Location: Poland, Malopolska
donkey7 23 May 2006, 18:23
if you're going to speed up your program then maybe look at algorithm improvements. on wikipedia are many better algos that will compute pi value faster. but if you're sticking with this algorithm, you also can do some speedups. we can do some mathematical transformations (example limited to 11, but you can extend it if you want):
Code:
1/1 - 1/3 + 1/5 - 1/7 + 1/9 - 1/11 =
(3-1)/(1*3) + (7-5)/(5*7) + (11-9)/(9*11) =
2/(1*3) + 2/(5*7) + 2/(9*11) =
pi/4

1/(1*3) + 1/(5*7) + 1/(9*11) =
(5*7*9*11 + 1*3*9*11 + 1*3*5*7) / (1*3*5*7*9*11) =
pi/8
    

as you can see, there are only one division left. it should be an considerable improvement. note: i'm not sure this transformations are right Smile and you should also use 80 bit because on 64 bit you can quickly get an overflow.
Post 23 May 2006, 18:23
View user's profile Send private message Visit poster's website Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  


< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum


Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.