flat assembler
Message board for the users of flat assembler.
Index
> Heap > Algorithm: FFT (Fast Fourier Transform) Goto page 1, 2, 3 Next 
Author 

Xorpd!
Are you sure that your algorithm even works? For instance if the high and low 3 bits of an index are all set, then how is it or its bitreversal to be found among the first (3/4)*(3/4)*SAMPLES indices?
Anyhow, bit reversal with a full length table scales badly, and radix2 FFT algorithms use more arithmetic operations than others and also utilize the register file poorly. A Complex class is much more useful given SIMD registers than separate real and imaginary part arrays. I am not familiar enough with Java to know whether the Complex class has float components or double components. To use SIMD, just write out a realhalf complex FFT algorithm. With one extra postprocessing step you can make a full complex FFT out of it. The realhalf complex algorithm is applied separately to the real and imaginary parts of the input data sets before the postprocessing step. When implemented as a Complex class, these separate operations actually can be carried out as 2way SIMD operations in parallel, thus utilizing these processor capabilities without even breaking a sweat. To get any kind of assembly level optimization that is worthwhile in FFT algorithms, it takes a lot of research just to get up to speed. There is definitely room for significant improvement over the best code available today, but to fill that space will take a lot of your effort. Best of luck. 

26 Oct 2009, 06:23 

revolution
Xorpd! wrote: There is definitely room for significant improvement over the best code available today, ... 

26 Oct 2009, 07:04 

r22
The 3/4th index reverse bit order was something I "re"discovered during my initial algorithm optimization.
The code SWAPs elements to the bit reversed position. If the element has already been SWAPped then it is ignored (well SWAPped with itself). By the time I reach 3/4th of the bit reversed array all elements that need to be SWAPped have been SWAPped. I'll have to look into calculating the REAL part first and adding the post processing step. FYI  Java doesn't have a native Complex number class the one I used was something I found on the web for convenience. Using the abstraction of the Complex class I failed to notice the numeric identities of the Complex arithmetic. The other variation of the FFT algo that I was looking at was the recursive (divide and conquer) method. Perhaps an optimized version would come out faster than the radix2 version. 

26 Oct 2009, 13:31 

Xorpd!
revolution wrote: How do you judge how much improvement is possible? IMO, the "best code available today" is already very efficient by most measures. I doubt there is much room to make any significant improvements without changing away from the FFT algorithms. Your opinion doesn't count for much in the current context. Look at Intel's MKL or IPP websites. They've got some benchmark results there. Compare their quoted throughput with with the best known hard limit of 2 double precision or 4 single precision additions per clock cycle along with the best operation counts for the various sizes of FFT. You will observe that the benchmarks don't come within a factor of 2 of this ideal, unlike what is the case for DGEMM or even SGEMM. As Bob Dylan advised, "Don't criticize what you don't understand." Highperformance assembly level optimization is one of the things, along with physics, that you clearly don't understand. 

26 Oct 2009, 16:29 

Xorpd!
r22 wrote: The 3/4th index reverse bit order was something I "re"discovered during my initial algorithm optimization. Yes, but the 3/4 factor is applied twice for a factor of 9/16. Go back and check your code to verify this. r22 wrote: The code SWAPs elements to the bit reversed position. If the element has already been SWAPped then it is ignored (well SWAPped with itself). By the time I reach 3/4th of the bit reversed array all elements that need to be SWAPped have been SWAPped. You didn't answer my question here. At what index r is 239 = 11101111b swapped with 247 = 11110111b in an FFT on 256 elements in your bitreversal loop? r22 wrote: I'll have to look into calculating the REAL part first and adding the post processing step. FYI  Java doesn't have a native Complex number class the one I used was something I found on the web for convenience. Using the abstraction of the Complex class I failed to notice the numeric identities of the Complex arithmetic. Again you didn't answer my question: are the elements of your Complex class single or double precision? This makes a difference because SSE[2] can operate on 2 double precision numbers simultaneously but can generate 4 single precision results in a single operation. Thus, the assembly realization of your code would be completely different for the two cases. r22 wrote: The other variation of the FFT algo that I was looking at was the recursive (divide and conquer) method. Perhaps an optimized version would come out faster than the radix2 version. Recursive and radix2 are not mutually exclusive. As I have said, this will take a lot of your time (months is conservative) to do a reasonable job. 

26 Oct 2009, 16:41 

tom tobias
Xorpd! wrote:
Bullshit. here's from your beloved Intel mkl site: http://software.intel.com/sites/products/collateral/hpc/mkl/mkl_indepth.pdf Quote: • Improved single threaded performance of up to 1.8 times on that's just more marketing hype. Show me the data. Intel has always, for thirty years, been 99% marketing, and 1%engineering. This web site is just more of the same old crap. I want numbers. DATA. not "1.8 times faster". How much time is required by this marvelous 64 bit Intel architecture, to compute a 256 point, 32 bit, single precision FFT using a split radix? I ran mine on a 486/66, both integer and single precision floating point, roughly the same execution time, about 1.0 to 1.1 milliseconds, i.e. approximately 4% of the frame time. So, is there a significant improvement on these times with Intel's mkl crap? I doubt it. But, please, go ahead, PROVE ME WRONG. I don't mind at all. What advantage does double precision, 64 bit computation afford, in the realm for which 256 point sampling is useful, i.e. voice recognition? Does double precision afford greater resolution on mathematical products, given that the initial analog to digital conversion is based on 16 bits? My personal opinion, which may be completely wrong, is that if I were to simply run my code on a modern cpu, i.e. "686" @ 3GHz, (fifty times faster than my old 486) I would expect an approximately 100 fold improvement on my times, due solely to the architectural improvements, having absolutely nothing whatsoever to do with Intel's MKL library. In other words, I would expect to measure times somewhere between 0.1 and 0.01 milliseconds for the 256 point 32 bit Duhamel split radix FFT. How much time does the MKL library require for the same 256 point FFT, with sampling at 16 kHz? It would not surprise me if the Intel MKL library required half a second to compute the same data. 

26 Oct 2009, 18:47 

Borsuc
tom, what world do you live in? First of all, FFT is used in almost ALL AUDIO ENGINEERING APPLICATIONS. Even if the input is 16bit, a FFT is a transform, ANY transform is best worked at a high resolution, for the same reason you don't work with .mp3 files, you work with a highprecision uncompressed format, and AT THE END downsample it to 16bit.
Errors accumulate, this is a basic principle, with ANY transform or sound manipulation. FFT time is also critical when doing veryoverlapped ShortTime Fourier Transforms as they can take a lot of time for long signals. (overlapped means much better quality in some cases). FFTs are used everywhere from EQ, to saturation effects, to "harmonic enhancing" filters, to anything dealing with the spectral domain. Point of the matter is, if you are dealing with audio editing or composition and mixing and applying many transformations, you should work with 32bit at least. 64bit can be seen as "no compromises". Afterwards downsample to 16bit and even convert to mp3 if you want, but not while you are transforming for goodness' sake! Also there are soundcards with 24bit ADCs... no offense but some of your "computational examples" are way too simplistic... there are very demanding apps out there _________________ Previously known as The_Grey_Beast 

26 Oct 2009, 19:58 

r22
@xorpd
 Using 3/4 ths of the reversing does infact cause some elements to be out of place. Roughly 3% of the elements. But I figured 3% wasn't THAT much so the end result would be close enough. Saying "*all* the elements that needed to be swapped were swapped" was a gross exaggeration. In any case, I'll have to see how the ~3% unshuffled elements affects the final intensity values.  The Complex number class is using DOUBLE precision, but I'm considering having both a DOUBLE and SINGLE precision ASM version. This may depend on if I accept limited accuracy.  The project is only for my amusement so time really isn't an issue. How accurate should an FFT algorithm be? Would cutting a few corners to get the job done faster be appropriate for certain use cases? 

26 Oct 2009, 20:14 

Borsuc
r22 wrote: How accurate should an FFT algorithm be? Would cutting a few corners to get the job done faster be appropriate for certain use cases? Why use floats though? _________________ Previously known as The_Grey_Beast 

26 Oct 2009, 22:52 

revolution
Xorpd! wrote: Your opinion doesn't count for much in the current context. Look at Intel's MKL or IPP websites. They've got some benchmark results there. Compare their quoted throughput with with the best known hard limit of 2 double precision or 4 single precision additions per clock cycle along with the best operation counts for the various sizes of FFT. You will observe that the benchmarks don't come within a factor of 2 of this ideal, unlike what is the case for DGEMM or even SGEMM. Xorpd! wrote: As Bob Dylan advised, "Don't criticize what you don't understand." Highperformance assembly level optimization is one of the things, along with physics, that you clearly don't understand. 

27 Oct 2009, 01:01 

revolution
Borsuc wrote: Why use floats though? 

27 Oct 2009, 01:03 

Xorpd!
tom tobias wrote: In other words, I would expect to measure times somewhere between 0.1 and 0.01 milliseconds for the 256 point 32 bit Duhamel split radix FFT. How much time does the MKL library require for the same 256 point FFT, with sampling at 16 kHz? It would not surprise me if the Intel MKL library required half a second to compute the same data. Frigo & Johnson quote 1.4 μs double precision and 0.8 μs single precision for MKL on 3.0 GHz Core 2 Duo. I think MKL uses radix4 rather than split radix. The latter algorithm is funky enough that it's awkward to try to fit into SSE2. Looking at some of these timing results it seems that the competitors have had time to get some of their SSE2 mojo working since the last time I looked. Can't find any timing results for Prime95, though. 

27 Oct 2009, 06:25 

Borsuc
@revolution: what's NNT?
I was talking about using integers/fixedpoint, since PCM data is that way anyway (not floats) so less conversion, and the limits are bound anyway normalized (not huge exponents). _________________ Previously known as The_Grey_Beast 

27 Oct 2009, 16:14 

revolution
Borsuc: The dynamics of FFT don't scale well for fixed point or integers. You would start to get overflow and rounding errors very quickly. If you look at the numbers produced by an FFT they can have a large dynamic range depending upon which transform you are doing.
Besides, why do you think integers would be faster? The x86 CPUs have very fast float operations and they can be parallelised with SSE. And I suggest you search for NTT (not NNT) if you really want to know about it. Pure integer transforms with no need for error detection and guaranteed ranges for results means that integers can be used for all calculations. But ... as usual there is no free lunch, NTT requires an expensive modulo reduction that doesn't really fit well with the current CPU ALU/FPU architecture. 

27 Oct 2009, 16:31 

Borsuc
I didn't know it can generate huge numbers, but with integers you would need less space (or more precision, if exponent is worthless). less space means more parallelism, or more precision which means it's better.
or so I thought anyway. Isn't there SSE for integers too (SSE2 IIRC)? _________________ Previously known as The_Grey_Beast 

27 Oct 2009, 17:07 

revolution
Borsuc: Yes, SSE2 supports integer operations. But there is no easy way to detect overflows and using fixed point requires an adjustment operation after each mul. It doesn't really fit well for FFT to use fixed point with integers.


27 Oct 2009, 17:19 

r22
The operations required:
Complex ADD, SUB, MUL Code: Complex ADD: r = Real1 + Real2 i = Imag1 + Imag2 Complex SUB: r = Real1  Real2 i = Imag1  Imag2 Complex MUL: r = (Real1 * Real2)  (Imag1 * Imag2) i = (Real1 * Imag2) + (Imag1 * Real2) The main calculation in the FFT Code: R = array of reals fft elements (R[x] I[x] would be a complex) I = array of imaginary fft elements P = array of the precomputed reals (WK in code at top of thread) Q = array of the precomputed imaginaries A,B,C = indexes into the arrays Steps: 1: tmpr = P[C] * R[B]  Q[C] * I[B] 2: tmpi = P[C] * I[B] + Q[C] * R[B] 3: R[B] = R[A]  tmpr 4: I[B] = I[A]  tmpi 5: R[A] = R[A] + tmpr 6: I[A] = I[A] + tmpi Assuming DOUBLE precision for now and INTERLACED elements (Array[0] = real0, Array[1] = imag0 etc..) Steps 1,2 (the complex MUL) Code: MOVDQA XMM0, Precomputed[C] ;;XMM0 = Q[C]  P[C] MOVDQA XMM1, Elements[B] ;;XMM1 = I[B]  R[B] PSHUFD XMM2, XMM0, 11101110b ;;XMM2 = Q[C]  Q[C] PSHUFD XMM3, XMM1, 01001110b ;;XMM3 = R[B]  I[B] MOVLHPD XMM0, XMM0 ;;XMM0 = P[C]  P[C] MULPD XMM2, XMM3 ;;XMM2 = Q[C] * R[B]  Q[C] * I[B] MULPD XMM0, XMM1 ;;XMM0 = P[C] * I[B]  P[C] * R[B] ;;need to flip the sign bit of the low QWORD/DOUBLE of XMM2 PADDQ XMM2, [ToggleLowSign];; dq 80000000h, 0 ADDPD XMM0, XMM2 ;;XMM0 = P[C] * I[B] + Q[C] * R[B]  P[C] * R[B]  Q[C] * I[B] Steps 3,4,5 and 6 Code: ;;XMM0 = tmpi  tmpr MOVDQA XMM1, Elements[A] ;; XMM2 = I[A]  R[A] MOVDQA XMM2, XMM1 ;;copy ADDPD XMM1, XMM0 ;;Complex ADD XMM1 = I[A] + tmpi  R[A] + tmpr SUBPD XMM2, XMM0 ;;Complex SUB XMM2 = I[A]  tmpi  R[A]  tmpr MOVDQA Elements[A], XMM1 ;;store MOVDQA Elements[B], XMM2 Making use of all the XMMx registers and interleaving the Load and Copy from Steps (3,4,5,6) to avoid the stalls could create relatively fast code. Fascinating... *EDIT* Complex SUBtraction was backwards 

27 Oct 2009, 20:14 

tom tobias
Xorpd! wrote:
First, THANK you, very much for this reference to the MIT lab. Secondly, Perhaps I will be proven wrong, in time I will attempt to run my code on their testbench. However: Thirdly: Their implementation, according to their home page, is based on a C program, whereas, mine is 100% ASM (TASM, DOS). Admittedly, ASM versus C doesn't always correlate with faster execution by definition. Fourth, (not forth,) is this time which you quoted, of 1.4 microseconds, the time needed to perform a 256 point FFT using 32 bit integer instructions with 16 bit data? I cannot envision my code executing faster than 10 microseconds on a modern cpu, if so, then their code would be at least seven times faster than mine. Fifth, as revolution has explained (patiently) many times, to me, and to others, execution speed will also depend upon data location in cache versus main memory. Since all of my voice data is coming into the computer from a 16 bit sound card, via DMA, it is transferred into RAM, not cache, whereas, (and I don't know this for a fact) Frigo and Johnson, may, or may not, have employed cache to store the data upon which to operate their FFT. I don't know (i.e. I don't remember) how much cache is on the 486 which I used for my testing, but certainly, when I measured execution times, it was based on new data, not old data sitting in cache. They, apparently employ SSE instructions, while I use plain vanilla, 32 bit integer instructions (Floating point is slightly slower.) So, my code may indeed be much slower than theirs, I agree, it is a reasonable possibility. I am keeping my money on Red, however, until after the test is run, if it is even possible for me to run my slow old code on their testing apparatus..... I will post back once I find time to do this chore.... 

28 Oct 2009, 12:54 

tom tobias
Xorpd! wrote:
It took me ten years, 1984 to 1993. 

28 Oct 2009, 13:07 

Goto page 1, 2, 3 Next < Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992020, Tomasz Grysztar.
Powered by rwasa.