flat assembler
Message board for the users of flat assembler.
Index
> Main > A few questions (cache line optimization) 
Author 

r22 20 May 2006, 19:00
The label that your jmp points too should be aligned.
align 16 top: ...loop code ... jmp top You might also have a SLIGHT problem in that the code you posted does not terminate You may want to put a counter in there and only let it iterate a FINITE amount of time. As for unrolling, you can probably get up to a 20% speed increase if you figure out the optimal amount to unroll the code. When dealing with cache optimizations the best way to find the correct amount is to test the possibilities with benchmarking. As for other speed optimizations, all I can say is use SSE/2 if your processor supports it. Double precision should be good enough compared to the 80bit precision of the FPU code. SSk. E code runs much faster then the FPU code and can also be made parallel for an even greater optimization. Here's a code snippet of a possible SIMD (sse/2) implementation. Hopefully it's correct enough to give you a start. Code: .code Start: movdqa xmm1, dqword[first] movdqa xmm2, dqword[divover] movdqa xmm3, dqword[add4] movdqa xmm4,xmm2 ;;copy movdqa xmm0, dqword[zero] mov ecx, 12; keep it a multiple of how many X you unroll ;;;;4X unroll 12/4 = 3 no remainder good Top: divpd xmm2, xmm1 ;; 1 / 1 1/3 etc addpd xmm1,xmm3 ;; 1  3 becomes 5  7 becomes 9  11 etc addpd xmm0,xmm2 ;; results positives  negatives divpd xmm2, xmm1 ;; 1 / 1 1/3 etc addpd xmm1,xmm3 ;; 1  3 becomes 5  7 becomes 9  11 etc addpd xmm0,xmm2 ;; results positives  negatives divpd xmm2, xmm1 ;; 1 / 1 1/3 etc addpd xmm1,xmm3 ;; 1  3 becomes 5  7 becomes 9  11 etc addpd xmm0,xmm2 ;; results positives  negatives divpd xmm2, xmm1 ;; 1 / 1 1/3 etc addpd xmm1,xmm3 ;; 1  3 becomes 5  7 becomes 9  11 etc addpd xmm0,xmm2 ;; results positives  negatives sub ecx,4 ;;decrement our counter jns Top ;; while ECX is >= 0 repeat ;;now we need to calculate our result movhlpd xmm1, xmm0 ;; moves the negative sums to xmm1 addsd xmm0,xmm1 ;;our result is now in the low qword of xmm0 movsd qword[Pi4Answer], xmm0 ... .data Pi4Answer dq 0 align 16 add4: dq 4.0, 4.0 first: dq 1.0, 3.0 ;;first two divisors divover: dq 1.0, 1.0 ;; because sign switches zero: dq 0.0, 0.0 

20 May 2006, 19:00 

mattst88 23 May 2006, 16:06
Wow, great stuff. Thank you very much.
Edit: One thing: I'm trying to assemble your code with FASM 1.66, and on 'movhlpd' it's returning 'illegal instruction'. Google doesn't have much information on this either. Most bizarrely, my intel manuals don't have this instruction either. I understand what the instruction does, but how can I make FASM recognize it? 

23 May 2006, 16:06 

donkey7 23 May 2006, 18:23
if you're going to speed up your program then maybe look at algorithm improvements. on wikipedia are many better algos that will compute pi value faster. but if you're sticking with this algorithm, you also can do some speedups. we can do some mathematical transformations (example limited to 11, but you can extend it if you want):
Code: 1/1  1/3 + 1/5  1/7 + 1/9  1/11 = (31)/(1*3) + (75)/(5*7) + (119)/(9*11) = 2/(1*3) + 2/(5*7) + 2/(9*11) = pi/4 1/(1*3) + 1/(5*7) + 1/(9*11) = (5*7*9*11 + 1*3*9*11 + 1*3*5*7) / (1*3*5*7*9*11) = pi/8 as you can see, there are only one division left. it should be an considerable improvement. note: i'm not sure this transformations are right and you should also use 80 bit because on 64 bit you can quickly get an overflow. 

23 May 2006, 18:23 

< Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992024, Tomasz Grysztar. Also on GitHub, YouTube.
Website powered by rwasa.