flat assembler
Message board for the users of flat assembler.

 Index > Heap > Algorithm: FFT (Fast Fourier Transform) Goto page 1, 2, 3  Next
Author
r22

Joined: 27 Dec 2004
Posts: 805
r22
I've optimized an in-place FFT algorithm in java.
I wanted to get a good algorithm level optimization before I ported it to ASM and really tried to tune it.

I'm using a immutable Complex number class (obviously using arrays of the REAL and IMAGINARY would be faster than this abstraction).

Thinking ahead I'm having trouble coming up with a suitable data layout to take advantage of SIMD.

Any comments / suggestions / criticisms are welcome.

Code:
```public class FFT {
public final int SAMPLES;
public final int RATE;
private final double SQRT;
private final int BINLOG;
private final Complex[][] WK;
private final int[] REV;

//initialization
public FFT(int samples, int rate){
//power of 2 samples
if(samples < 2 || (samples & (samples - 1)) != 0)
System.exit(-1);
//init
SAMPLES = samples;
RATE = rate;
SQRT = Math.sqrt((double)SAMPLES);
//binary log
BINLOG = (int)(Math.log(samples) / Math.log(2));
//bit reverse order
// modified for in-place shuffle
REV = new int[SAMPLES];
for (int i=0; i<SAMPLES; i++){
int rev=0;
for(int ii=0; ii<BINLOG; ii++)
rev |= ( ( (i >> ii) & 1 ) << (BINLOG - 1 - ii) );
REV[i] = i < rev? rev : i;
}
//pre compute
WK = new Complex[BINLOG][];
for (int lvl=0, bin=2; lvl<BINLOG; lvl++){
WK[lvl] = new Complex[1 << lvl];
for (int i=0; i<(1 << lvl); i++){
double re =  Math.cos(2. * Math.PI * i / bin);
double im = -Math.sin(2. * Math.PI * i / bin);
WK[lvl][i] = new Complex(re, im);
}
bin = bin << 1;
}

}
//utility - intensity / magnitude
public final double intensity(Complex val){
return Math.sqrt(val.re()*val.re() + val.im()*val.im()) / SQRT;
}
//utility - frequency
public final double frequency(int index){
return (double)(index * RATE) / (double)SAMPLES;
}

public final Complex[] fft_opt(Complex[] x){
//power of 2 samples
if(x.length < 2 || (x.length & (x.length - 1)) != 0)
System.exit(-1);
//rearrange using reverse bit order
for(int r=0; r<REV.length; r=r+2){
Complex tmp1 = x[REV[r]];
Complex tmp2 = x[REV[r+1]];
x[REV[r]] = x[r];
x[REV[r+1]] = x[r+1];
x[r] = tmp1;
x[r+1] = tmp2;
}
//compute FFT
for (int lvl=0, step=1; lvl<BINLOG; lvl++){
for (int j=0; j<step; j++){
Complex U = WK[lvl][j];
for (int i=j; i<SAMPLES; i=i+(step << 1)){
//fft calculation
Complex T = U.times(x[step+i]);
x[i+step] = x[i].minus(T);
x[i] = x[i].plus(T);
}
}
step = step << 1;
}
return x;
}
}
```

*EDIT* I decided on correctness (so no more truncated shuffle), also a bug fix on the element order of the complex subtraction (minus).

Last edited by r22 on 28 Oct 2009, 01:45; edited 1 time in total
25 Oct 2009, 19:04
Xorpd!

Joined: 21 Dec 2006
Posts: 161
Xorpd!
Are you sure that your algorithm even works? For instance if the high and low 3 bits of an index are all set, then how is it or its bit-reversal to be found among the first (3/4)*(3/4)*SAMPLES indices?
Anyhow, bit reversal with a full length table scales badly, and radix-2 FFT algorithms use more arithmetic operations than others and also utilize the register file poorly.
A Complex class is much more useful given SIMD registers than separate real and imaginary part arrays. I am not familiar enough with Java to know whether the Complex class has float components or double components. To use SIMD, just write out a real-half complex FFT algorithm. With one extra post-processing step you can make a full complex FFT out of it. The real-half complex algorithm is applied separately to the real and imaginary parts of the input data sets before the post-processing step. When implemented as a Complex class, these separate operations actually can be carried out as 2-way SIMD operations in parallel, thus utilizing these processor capabilities without even breaking a sweat.
To get any kind of assembly level optimization that is worthwhile in FFT algorithms, it takes a lot of research just to get up to speed. There is definitely room for significant improvement over the best code available today, but to fill that space will take a lot of your effort. Best of luck.
26 Oct 2009, 06:23
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
Xorpd! wrote:
There is definitely room for significant improvement over the best code available today, ...
How do you judge how much improvement is possible? IMO, the "best code available today" is already very efficient by most measures. I doubt there is much room to make any significant improvements without changing away from the FFT algorithms.
26 Oct 2009, 07:04
r22

Joined: 27 Dec 2004
Posts: 805
r22
The 3/4th index reverse bit order was something I "re"discovered during my initial algorithm optimization.

The code SWAPs elements to the bit reversed position. If the element has already been SWAPped then it is ignored (well SWAPped with itself). By the time I reach 3/4th of the bit reversed array all elements that need to be SWAPped have been SWAPped.

I'll have to look into calculating the REAL part first and adding the post processing step. FYI - Java doesn't have a native Complex number class the one I used was something I found on the web for convenience. Using the abstraction of the Complex class I failed to notice the numeric identities of the Complex arithmetic.

The other variation of the FFT algo that I was looking at was the recursive (divide and conquer) method. Perhaps an optimized version would come out faster than the radix-2 version.
26 Oct 2009, 13:31
Xorpd!

Joined: 21 Dec 2006
Posts: 161
Xorpd!
revolution wrote:
How do you judge how much improvement is possible? IMO, the "best code available today" is already very efficient by most measures. I doubt there is much room to make any significant improvements without changing away from the FFT algorithms.

Your opinion doesn't count for much in the current context. Look at Intel's MKL or IPP websites. They've got some benchmark results there. Compare their quoted throughput with with the best known hard limit of 2 double precision or 4 single precision additions per clock cycle along with the best operation counts for the various sizes of FFT. You will observe that the benchmarks don't come within a factor of 2 of this ideal, unlike what is the case for DGEMM or even SGEMM.
As Bob Dylan advised, "Don't criticize what you don't understand." High-performance assembly level optimization is one of the things, along with physics, that you clearly don't understand.
26 Oct 2009, 16:29
Xorpd!

Joined: 21 Dec 2006
Posts: 161
Xorpd!
r22 wrote:
The 3/4th index reverse bit order was something I "re"discovered during my initial algorithm optimization.

Yes, but the 3/4 factor is applied twice for a factor of 9/16. Go back and check your code to verify this.

r22 wrote:
The code SWAPs elements to the bit reversed position. If the element has already been SWAPped then it is ignored (well SWAPped with itself). By the time I reach 3/4th of the bit reversed array all elements that need to be SWAPped have been SWAPped.

You didn't answer my question here. At what index r is 239 = 11101111b swapped with 247 = 11110111b in an FFT on 256 elements in your bit-reversal loop?

r22 wrote:
I'll have to look into calculating the REAL part first and adding the post processing step. FYI - Java doesn't have a native Complex number class the one I used was something I found on the web for convenience. Using the abstraction of the Complex class I failed to notice the numeric identities of the Complex arithmetic.

Again you didn't answer my question: are the elements of your Complex class single or double precision? This makes a difference because SSE[2] can operate on 2 double precision numbers simultaneously but can generate 4 single precision results in a single operation. Thus, the assembly realization of your code would be completely different for the two cases.

r22 wrote:
The other variation of the FFT algo that I was looking at was the recursive (divide and conquer) method. Perhaps an optimized version would come out faster than the radix-2 version.

Recursive and radix-2 are not mutually exclusive. As I have said, this will take a lot of your time (months is conservative) to do a reasonable job.
26 Oct 2009, 16:41
tom tobias

Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias
Xorpd! wrote:

Look at Intel's MKL or IPP websites. They've got some benchmark results there. Compare their quoted throughput with with the best known hard limit of 2 double precision or 4 single precision additions per clock cycle along with the best operation counts for the various sizes of FFT.

Bullshit.

here's from your beloved Intel mkl site:

http://software.intel.com/sites/products/collateral/hpc/mkl/mkl_indepth.pdf
Quote:
• Improved single threaded performance of up to 1.8 times on
complex 1D FFTs for power-of-two sizes
• On Intel® 64 architecture-based systems running in 64-bit mode
single precision complex backward 1D FFT for data sizes greater
than 2^22, elements have been sped up by up to 2 times on 4
threads and up to 2.4 times on 8 threads on Intel Itanium processors.

that's just more marketing hype.
Show me the data.
Intel has always, for thirty years, been 99% marketing, and 1%engineering. This web site is just more of the same old crap.

I want numbers. DATA. not "1.8 times faster".

How much time is required by this marvelous 64 bit Intel architecture, to compute a 256 point, 32 bit, single precision FFT using a split radix?
I ran mine on a 486/66, both integer and single precision floating point, roughly the same execution time, about 1.0 to 1.1 milliseconds, i.e. approximately 4% of the frame time.

So, is there a significant improvement on these times with Intel's mkl crap? I doubt it. But, please, go ahead, PROVE ME WRONG. I don't mind at all.

What advantage does double precision, 64 bit computation afford, in the realm for which 256 point sampling is useful, i.e. voice recognition? Does double precision afford greater resolution on mathematical products, given that the initial analog to digital conversion is based on 16 bits?

My personal opinion, which may be completely wrong, is that if I were to simply run my code on a modern cpu, i.e. "686" @ 3GHz, (fifty times faster than my old 486) I would expect an approximately 100 fold improvement on my times, due solely to the architectural improvements, having absolutely nothing whatsoever to do with Intel's MKL library.

In other words, I would expect to measure times somewhere between 0.1 and 0.01 milliseconds for the 256 point 32 bit Duhamel split radix FFT. How much time does the MKL library require for the same 256 point FFT, with sampling at 16 kHz? It would not surprise me if the Intel MKL library required half a second to compute the same data.
26 Oct 2009, 18:47
Borsuc

Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
tom, what world do you live in? First of all, FFT is used in almost ALL AUDIO ENGINEERING APPLICATIONS. Even if the input is 16-bit, a FFT is a transform, ANY transform is best worked at a high resolution, for the same reason you don't work with .mp3 files, you work with a high-precision uncompressed format, and AT THE END downsample it to 16-bit.

Errors accumulate, this is a basic principle, with ANY transform or sound manipulation.

FFT time is also critical when doing very-overlapped Short-Time Fourier Transforms as they can take a lot of time for long signals. (overlapped means much better quality in some cases). FFTs are used everywhere from EQ, to saturation effects, to "harmonic enhancing" filters, to anything dealing with the spectral domain.

Point of the matter is, if you are dealing with audio editing or composition and mixing and applying many transformations, you should work with 32-bit at least. 64-bit can be seen as "no compromises". Afterwards downsample to 16-bit and even convert to mp3 if you want, but not while you are transforming for goodness' sake!

Also there are soundcards with 24-bit ADCs...

no offense but some of your "computational examples" are way too simplistic... there are very demanding apps out there

_________________
Previously known as The_Grey_Beast
26 Oct 2009, 19:58
r22

Joined: 27 Dec 2004
Posts: 805
r22
@xorpd
- Using 3/4 ths of the reversing does in-fact cause some elements to be out of place. Roughly 3% of the elements. But I figured 3% wasn't THAT much so the end result would be close enough. Saying "*all* the elements that needed to be swapped were swapped" was a gross exaggeration. In any case, I'll have to see how the ~3% un-shuffled elements affects the final intensity values.
- The Complex number class is using DOUBLE precision, but I'm considering having both a DOUBLE and SINGLE precision ASM version. This may depend on if I accept limited accuracy.
- The project is only for my amusement so time really isn't an issue.

How accurate should an FFT algorithm be? Would cutting a few corners to get the job done faster be appropriate for certain use cases?
26 Oct 2009, 20:14
Borsuc

Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
r22 wrote:
How accurate should an FFT algorithm be? Would cutting a few corners to get the job done faster be appropriate for certain use cases?
IMO depends on the application, or the app can also offer an option. For real-time entertainment applications I think being faster is ok, but for something like offline rendering, you should be 100% for quality, at least IMO.

Why use floats though?

_________________
Previously known as The_Grey_Beast
26 Oct 2009, 22:52
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
Xorpd! wrote:
Your opinion doesn't count for much in the current context. Look at Intel's MKL or IPP websites. They've got some benchmark results there. Compare their quoted throughput with with the best known hard limit of 2 double precision or 4 single precision additions per clock cycle along with the best operation counts for the various sizes of FFT. You will observe that the benchmarks don't come within a factor of 2 of this ideal, unlike what is the case for DGEMM or even SGEMM.
The problem here is that the Intel code is crap for performance. There is a lot of code available that is much better than what Intel provide. A good example is the Prime95 FFT code.
Xorpd! wrote:
As Bob Dylan advised, "Don't criticize what you don't understand." High-performance assembly level optimization is one of the things, along with physics, that you clearly don't understand.
Hmm.
27 Oct 2009, 01:01
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
Borsuc wrote:
Why use floats though?
Are you suggesting using NTT? That has it's own problems. If there were a special math unit built into the CPU for modulo reduction then perhaps NTT could become useful and compete with FFT for speed.
27 Oct 2009, 01:03
Xorpd!

Joined: 21 Dec 2006
Posts: 161
Xorpd!
tom tobias wrote:
In other words, I would expect to measure times somewhere between 0.1 and 0.01 milliseconds for the 256 point 32 bit Duhamel split radix FFT. How much time does the MKL library require for the same 256 point FFT, with sampling at 16 kHz? It would not surprise me if the Intel MKL library required half a second to compute the same data.

Frigo & Johnson quote 1.4 μs double precision and 0.8 μs single precision for MKL on 3.0 GHz Core 2 Duo. I think MKL uses radix-4 rather than split radix. The latter algorithm is funky enough that it's awkward to try to fit into SSE2. Looking at some of these timing results it seems that the competitors have had time to get some of their SSE2 mojo working since the last time I looked. Can't find any timing results for Prime95, though.
27 Oct 2009, 06:25
Borsuc

Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
@revolution: what's NNT?
I was talking about using integers/fixed-point, since PCM data is that way anyway (not floats) so less conversion, and the limits are bound anyway normalized (not huge exponents).

_________________
Previously known as The_Grey_Beast
27 Oct 2009, 16:14
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
Borsuc: The dynamics of FFT don't scale well for fixed point or integers. You would start to get overflow and rounding errors very quickly. If you look at the numbers produced by an FFT they can have a large dynamic range depending upon which transform you are doing.

Besides, why do you think integers would be faster? The x86 CPUs have very fast float operations and they can be parallelised with SSE.

And I suggest you search for NTT (not NNT) if you really want to know about it. Pure integer transforms with no need for error detection and guaranteed ranges for results means that integers can be used for all calculations. But ... as usual there is no free lunch, NTT requires an expensive modulo reduction that doesn't really fit well with the current CPU ALU/FPU architecture.
27 Oct 2009, 16:31
Borsuc

Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
I didn't know it can generate huge numbers, but with integers you would need less space (or more precision, if exponent is worthless). less space means more parallelism, or more precision which means it's better.

or so I thought anyway. Isn't there SSE for integers too (SSE2 IIRC)?

_________________
Previously known as The_Grey_Beast
27 Oct 2009, 17:07
revolution
When all else fails, read the source

Joined: 24 Aug 2004
Posts: 17270
Location: In your JS exploiting you and your system
revolution
Borsuc: Yes, SSE2 supports integer operations. But there is no easy way to detect overflows and using fixed point requires an adjustment operation after each mul. It doesn't really fit well for FFT to use fixed point with integers.
27 Oct 2009, 17:19
r22

Joined: 27 Dec 2004
Posts: 805
r22
The operations required:
Complex ADD, SUB, MUL
Code:
```Complex ADD:
r = Real1 + Real2
i = Imag1 + Imag2

Complex SUB:
r = Real1 - Real2
i = Imag1 - Imag2

Complex MUL:
r = (Real1 * Real2) - (Imag1 * Imag2)
i = (Real1 * Imag2) + (Imag1 * Real2)
```

The main calculation in the FFT
Code:
```R = array of reals fft elements (R[x] I[x] would be a complex)
I = array of imaginary fft elements
P = array of the precomputed reals (WK in code at top of thread)
Q = array of the precomputed imaginaries
A,B,C = indexes into the arrays
Steps:
1: tmpr = P[C] * R[B] - Q[C] * I[B]
2: tmpi = P[C] * I[B] + Q[C] * R[B]
3: R[B] = R[A] - tmpr
4: I[B] = I[A] - tmpi
5: R[A] = R[A] + tmpr
6: I[A] = I[A] + tmpi
```

Assuming DOUBLE precision for now and INTERLACED elements (Array[0] = real0, Array[1] = imag0 etc..)

Steps 1,2 (the complex MUL)
Code:
```MOVDQA XMM0, Precomputed[C] ;;XMM0 = Q[C] | P[C]
MOVDQA XMM1, Elements[B] ;;XMM1 = I[B] | R[B]
PSHUFD XMM2, XMM0, 11101110b ;;XMM2 = Q[C] | Q[C]
PSHUFD XMM3, XMM1, 01001110b ;;XMM3 = R[B] | I[B]
MOVLHPD XMM0, XMM0 ;;XMM0 = P[C] | P[C]
MULPD XMM2, XMM3 ;;XMM2 = Q[C] * R[B] | Q[C] * I[B]
MULPD XMM0, XMM1 ;;XMM0 = P[C] * I[B] | P[C] * R[B]
;;need to flip the sign bit of the low QWORD/DOUBLE of XMM2
PADDQ XMM2, [ToggleLowSign];; dq 80000000h, 0
;;XMM0 = P[C] * I[B] + Q[C] * R[B] | P[C] * R[B] - Q[C] * I[B]
```

Steps 3,4,5 and 6
Code:
```;;XMM0 = tmpi | tmpr
MOVDQA XMM1, Elements[A] ;; XMM2 = I[A] | R[A]
MOVDQA XMM2, XMM1 ;;copy
ADDPD XMM1, XMM0 ;;Complex ADD XMM1 = I[A] + tmpi | R[A] + tmpr
SUBPD XMM2, XMM0 ;;Complex SUB XMM2 = I[A] - tmpi | R[A] - tmpr
MOVDQA Elements[A], XMM1 ;;store
MOVDQA Elements[B], XMM2
```

Making use of all the XMMx registers and interleaving the Load and Copy from Steps (3,4,5,6) to avoid the stalls could create relatively fast code.

Fascinating...
*EDIT* Complex SUBtraction was backwards
27 Oct 2009, 20:14
tom tobias

Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias
Xorpd! wrote:

Frigo & Johnson quote 1.4 μs double precision and 0.8 μs single precision for MKL on 3.0 GHz Core 2 Duo. I think MKL uses radix-4 rather than split radix. The latter algorithm is funky enough that it's awkward to try to fit into SSE2.

First, THANK you, very much for this reference to the MIT lab.

Secondly, Perhaps I will be proven wrong, in time I will attempt to run my code on their testbench. However:

Thirdly: Their implementation, according to their home page, is based on a C program, whereas, mine is 100% ASM (TASM, DOS). Admittedly, ASM versus C doesn't always correlate with faster execution by definition.

Fourth, (not forth,) is this time which you quoted, of 1.4 microseconds, the time needed to perform a 256 point FFT using 32 bit integer instructions with 16 bit data? I cannot envision my code executing faster than 10 microseconds on a modern cpu, if so, then their code would be at least seven times faster than mine.

Fifth, as revolution has explained (patiently) many times, to me, and to others, execution speed will also depend upon data location in cache versus main memory. Since all of my voice data is coming into the computer from a 16 bit sound card, via DMA, it is transferred into RAM, not cache, whereas, (and I don't know this for a fact) Frigo and Johnson, may, or may not, have employed cache to store the data upon which to operate their FFT. I don't know (i.e. I don't remember) how much cache is on the 486 which I used for my testing, but certainly, when I measured execution times, it was based on new data, not old data sitting in cache.

They, apparently employ SSE instructions, while I use plain vanilla, 32 bit integer instructions (Floating point is slightly slower.)

So, my code may indeed be much slower than theirs, I agree, it is a reasonable possibility. I am keeping my money on Red, however, until after the test is run, if it is even possible for me to run my slow old code on their testing apparatus.....

I will post back once I find time to do this chore....
28 Oct 2009, 12:54
tom tobias

Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias
Xorpd! wrote:

To get any kind of assembly level optimization that is worthwhile in FFT algorithms, it takes a lot of research just to get up to speed. There is definitely room for significant improvement over the best code available today, but to fill that space will take a lot of your effort.

It took me ten years, 1984 to 1993.

28 Oct 2009, 13:07
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First

 Jump to: Select a forum Official----------------AssemblyPeripheria General----------------MainDOSWindowsLinuxUnixMenuetOS Specific----------------MacroinstructionsCompiler InternalsIDE DevelopmentOS ConstructionNon-x86 architecturesHigh Level LanguagesProgramming Language DesignProjects and IdeasExamples and Tutorials Other----------------FeedbackHeapTest Area
Goto page 1, 2, 3  Next

Forum Rules:
 You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou can attach files in this forumYou can download files in this forum