flat assembler
Message board for the users of flat assembler.

Index > Heap > Algorithm: FFT (Fast Fourier Transform)

Goto page Previous  1, 2, 3
Author
Thread Post new topic Reply to topic
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
tom tobias wrote:
H. V. SORENSEN, M. T. HEIDEMAN, AND C. S. BURRUS, On computing the split-radix FFT, IEEE Transactions on Acoustics, Speech and Signal Processing, 34 (1986), pp. 152–156.

Smile


For a second there I thought you were going to reference Sorenson, Jones, Heideman, & Burris, Real-valued fast Fourier transform algorithms, IEEE Transactions on Acoustics, Speech and Signal Processing 35 (1987), pp. 849-863. The problem with split radix (due to Yavne (1968), not Duhamel (1984)) is that after a pass through the data half the outputs are treated differently from the other half. This typically leads to data being stored back into memory rather than being further processed, and this increases the number of loads and stores.
Going to radix 4 or 8 lowers the number of loads and stores significantly. For a radix 2 step, you do 6 (real) additions, 4 multiplications, 6 loads, and 4 stores. With SSE2 these operations all get carried out in pairs. A radix 4 step requires 22 additions, 12 multiplications, 14 loads, and 8 stores. We can see that additions are reduced by 8%, multiplications by 25%, loads by 42%, and stores by 50% compared to the 4 radix 2 steps that a radix 4 step replaces.
But I think too many people ignore the possibility of using real-half complex FFT algorithms because the inputs are more often than not real. It's just that the algorithms that use exclusively real arithmetic are more complex than the complex arithmetic algorithms. But for efficiency you need the real-half complex algorithms, even if the full complex DFT is what is desired, getting there via a real-half complex DFT avoids the data swizzling overhead that using complex multiplications entails.
And I wish you hadn't talked CHClF2 into timing with cold instruction and data caches. Under those conditions it's hard to see the signal of a possible small improvement in code efficiency against the noise of loading instructions and data from memory. Also it makes it hard to compare with published results. At least let him do 4 transforms starting with the same data set, printing out timing results for each transform. In this way you can see the effect of memory on the first transform and the rest of us can see the effect of coding improvements in the last 3. This small number should be sufficient to warm up caches and BTBs yet not cause overflow problem.
Post 07 Nov 2009, 07:49
View user's profile Send private message Visit poster's website Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17274
Location: In your JS exploiting you and your system
revolution
Xorpd! wrote:
And I wish you hadn't talked CHClF2 into timing with cold instruction and data caches. Under those conditions it's hard to see the signal of a possible small improvement in code efficiency against the noise of loading instructions and data from memory. Also it makes it hard to compare with published results. At least let him do 4 transforms starting with the same data set, printing out timing results for each transform. In this way you can see the effect of memory on the first transform and the rest of us can see the effect of coding improvements in the last 3. This small number should be sufficient to warm up caches and BTBs yet not cause overflow problem.
Real world tests are the most important. Synthetic tests are only useful if they model a significant portion of what a real world situation will do. If in the real usage situation the data is always "cold" in RAM then optimising the FFT algo is pointless and wastful since it only accounts for a minor portion of the overall performance measurement. Better to optimise the RAM access portions. Like I mentioned above, if you ignore the TLB associativity and simply "hope for the best" then you will never get good performance no matter how fast the inner FFT is. In all my FFT programming I have found that the actual FFT is not a significant delay, it is always acquiring the data and getting it into the CPU that takes the most time. RAM is slow, and you have to take that into account else you just end up spinning wheels waiting for other things to happen before you can start computing the FFTs. Without data to work on the isolated FFT runtime is meaningless.
Post 07 Nov 2009, 08:35
View user's profile Send private message Visit poster's website Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias
Xorpd! wrote:
For a second there I thought you were going to reference Sorenson, Jones, Heideman, & Burris, Real-valued fast Fourier transform algorithms, IEEE Transactions on Acoustics, Speech and Signal Processing 35 (1987), pp. 849-863.
Thanks. Back in '86 I was living in Houston, where Burrus was the famous professor of Electrical Engineering at Rice University, I read all of his papers, with various degrees of comprehension, and frequently, these days, confound one with the other. Sorry, for the miscommunication. Sorensen and Burrus with or without one or more accomplice, wrote an article, I no longer remember which, that spelled it out, pretty thoroughly, exactly how one should implement the split radix fft, and that was what I used, essentially just copying their work, which had been directed towards a TI chip, I forgot now which one, so all I did was recompose their code to 30386 assembly... Well, there may have been one or two other small details that differed from their approach....
Wink

Xorpd! wrote:
The problem with split radix (due to Yavne (1968), not Duhamel (1984)) ...
Outstanding. Thank you very much for opening my eyes. Wow. I learned something new. I must not yet be as old and decrepit as I imagine. Oh, well, ok, yeah, perhaps that doesn't follow...

Xorpd! wrote:
....this increases the number of loads and stores.
Hmmmm. I am not sure about that!!!
hahaha.
I think that premise may be open to investigation....Or, maybe I am wrong....Hmm. Makes me almost want to get out my old box of floppies, and dig out the ones where I did the time measurements, back in '92-3.

Xorpd! wrote:

For a radix 2 step, you do 6 (real) additions, 4 multiplications, 6 loads, and 4 stores. With SSE2 these operations all get carried out in pairs. A radix 4 step requires 22 additions, 12 multiplications, 14 loads, and 8 stores. We can see that additions are reduced by 8%, multiplications by 25%, loads by 42%, and stores by 50% compared to the 4 radix 2 steps that a radix 4 step replaces.
Umm, I am not disputing you, because you obviously know infinitely more about FFT than I ever will, however, if you will send me a private message with your email address, we can discuss something more on this topic, if you wish. I do not believe, sitting here, having already forgotten my name, let alone the FFT, that everything you have written is exactly correct, for the method I chose. Maybe I implemented the FFT incorrectly, or perhaps I simply have forgotten how I did it, or both.

Xorpd! wrote:

It's just that the algorithms that use exclusively real arithmetic are more complex than....
But, friend, if we are not talking about real world applications, with real numeric data arriving at the computer, what is the point of describing MIT's FFFW (so called fastest Fourier Transform in the West--referring here to the USA "western" genre of cinema, in which gunslingers each claimed to be faster than one another)?
Xorpd! wrote:

And I wish you hadn't talked CHClF2 into timing with cold instruction and data caches...
I believe you have overestimated my influence on anyone. I repeat myself: (we call that perseveration, in mental health disorders!) If it isn't real data, from the real world, the execution speed of the FFT is meaningless, in my humble opinion.
Xorpd! wrote:

Under those conditions it's hard to see the signal of a possible small improvement in code efficiency against the noise of loading instructions and data from memory.
Nope. Sorry. don't agree at all.
First of all, the "noise" associated with signalling the onset of computation, once all 4096 points have been loaded into memory, is equal to the time needed to handle the interrupt, and nothing more. In competent hands, not my own, obviously, that task is both trivial, and essentially timeless, compared with the huge amount of time needed to execute the FFT. But, even if it takes one microsecond, so what? The time measurement does not begin until AFTER the interrupt has been received.
1. pump in the data;
2. all data gathered?
3. no: back to 1;
4. yes: send interrupt;
5. start timer
6. compute FFT
7. compute Power Spectrum
8. measure time
9. display result.

There is no "noise" associated with "loading data".

Xorpd! wrote:

Also it makes it hard to compare with published results.
I apologize if everyone else in the world, or at least, the MIT world, is doing it WRONG, but here on the FASM forum, we endeavor to do things correctly. Measurement of FFT execution times on data already in cache makes absolutely no sense, whatsoever. If that's how folks at MIT are doing it, then, sorry, they are doing it INCORRECTLY.

Xorpd! wrote:

This small number should be sufficient to warm up caches and BTBs yet not cause overflow problem.
Gosh, I sure must have done my FFT incorrectly. I haven't the foggiest recollection of doing anything at all to warm up any cache, in fact, I am not sure that the 386 I used for most of my work, even had cache available. But, if it did, I was unaware of having it, I certainly did not tune it, massage it, heat it up, or exploit it in any way, shape or form.

revolution wrote:
In all my FFT programming I have found that the actual FFT is not a significant delay, it is always acquiring the data and getting it into the CPU that takes the most time.
Well, you are probably correct, but, I had not been discussing the latency for getting the data into the machine, but rather, just the execution time of the FFT plus Power Spectrum. I ignored the time needed to acquire the signal, because of the sampling frequency interaction with that time measurement.
For example, with Magnetic Resonance Imaging, particularly fMRI, one is dealing with a much faster sampling frequency than for human voice. So, I was focused on frame time, i.e. the percentage of time required to compute the FFT, relative to the amount of time needed to acquire the data. In my case, best time ever measured, using the fastest cpu I ever worked with, the 486/DX-2: 66MHz, the execution time was about 1 millisecond, representing about 4% of the frame time.

For that era, that execution time was speedy gonzalez.
revolution wrote:

Without data to work on, the isolated FFT runtime is meaningless.
absolutely 100% agreement.
Post 07 Nov 2009, 21:54
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
A few points...

xorpd! - My forum handle has nothing to do with the ozone killing byproduct of aerosol cans, although it is an amusing coincidence.
- Changing how the code is initialized, timed and benchmarked is trivial, so I plan to have as many timing metrics as necessary.

revolution - If you are aware of an efficient means/ordering/algorithm to bring the FFT data to the processor cache then by all means elaborate. Since your earlier posts alluded to per chipset optimizations I can only assume you don't consider there to be a proper enough generalized solution.

Real-time or post processed, cached or in RAM, if the algorithm uses a more efficient flow and opcodes then it's a superior implementation despite whether it runs an order of magnitude faster or not. I'd prefer a general case solution, but I'd be satisfied if in the end it was the fastest on only a small subset of hardware.

x86-64 likes to work with 64byte data blocks (Intel/AMD arch optimization manuals), which makes the case for using radix-4 for double precision FFT.

Arithmetic wise using a radix-16 FFT for 256 elements would seem optimal because you'd be able to easily unroll (2x) the outer loop (the recursive 'level'), also the number of adds and muls would be lower. However, the inner loop would be fairly XMMx register starved.

This subject is holding my attention longer than most, so perhaps next week I'll get some more code up on this thread. I'm thinking radix-4 FFT 256 (optimized to about the level that the current code is) with a more robust benchmark.

Aside:
Some of the FFT implementations I've found online are in Fortran or converted to C from Fortran. I was forced to remember how embarrassingly horrid Fortran syntax is, especially the variable naming convention.
Post 07 Nov 2009, 23:39
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17274
Location: In your JS exploiting you and your system
revolution
r22 wrote:
revolution - If you are aware of an efficient means/ordering/algorithm to bring the FFT data to the processor cache then by all means elaborate. Since your earlier posts alluded to per chipset optimizations I can only assume you don't consider there to be a proper enough generalized solution.
Well I was thinking of the NUMA/SMP thing there. But this all comes back to how you handle the data while it is in RAM. Since the RAM is slow then dealing with RAM will be a significant portion of the execution time. You can read the TLB metrics from the CPU and order your data so that successive FFT passes do not force thrashing (and eviction) of the cache tag. During the inner loops this is not an issue but this can cause unnecessary delays when doing the outer loops that access data in multiples of the cache associativity size. Each time your data stream reaches the size of the cache block then you insert a gap equal in size to one cache line.
Post 08 Nov 2009, 00:09
View user's profile Send private message Visit poster's website Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
tom tobias wrote:
Back in '86 I was living in Houston, where Burrus was the famous professor of Electrical Engineering at Rice University, I read all of his papers

He wrote about 700 papers, IIRC. Quite a lot of reading.
tom tobias wrote:
Xorpd! wrote:

For a radix 2 step, you do 6 (real) additions, 4 multiplications, 6 loads, and 4 stores. With SSE2 these operations all get carried out in pairs. A radix 4 step requires 22 additions, 12 multiplications, 14 loads, and 8 stores. We can see that additions are reduced by 8%, multiplications by 25%, loads by 42%, and stores by 50% compared to the 4 radix 2 steps that a radix 4 step replaces.
Umm, I am not disputing you, because you obviously know infinitely more about FFT than I ever will, however, if you will send me a private message with your email address, we can discuss something more on this topic, if you wish.

No PM, sorry. Algorithmic choices are more interesting now than in 486 days because now you have to deal with pipelining and SIMD.
tom tobias wrote:
But, friend, if we are not talking about real world applications, with real numeric data arriving at the computer,

The real world is a different reality for all participants. I have seen an algorithm where an FFT was just a step in an iterative process. Caches can stay warm in that context. Even if you are going to start with cold caches I would try to benchmark separate aspects of the problem just as I would modularize code development and testing. Difficult problems admit the possibility of making progress in several steps, and it seems useful to me to check whether any step taken is a step forward or a step back.
Post 08 Nov 2009, 05:14
View user's profile Send private message Visit poster's website Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias
Xorpd! wrote:
The real world is a different reality for all participants.
Well, in terms of reality, I am thinking of the sunshine in that Egyptian desert, heating those brass cannons, with the result that Napolean's invading French army's cannon fire was misdirected and imprecise. Fourier's assignment, traveling with the military, led to development of his famous transform, which we now employ in everything from medical scanning devices for detection of injury and disease, to audio signal processing, and many, many other real world applications... That's not to write that the FFT could not also be a subject for mathematics itself, in which environment, a simulation exercise may be extraordinarily useful.

FASM too can be run on a emulator, but, for most of us, its strength lies in its real world applicability. Real world does not mean "virtual reality", or "simulation", it means concrete data, measurable, and quantifiable, like the heat in the Egyptian desert. Using the FFT to solve genuine problems, not simulation exercises, harkens back to that day in the 19th century when Fourier devised his elegant transform.

Xorpd! wrote:
Algorithmic choices are more interesting now than in 486 days because now you have to deal with pipelining and SIMD.
Well, I confess that when the 486 first arrived, I wasted a year trying to split the problem into two computations, half completed by the integrated math coprocessor, and half computed by the integer unit. So, yes, I am guilty of spending a lot of time trying to optimize the end result (a speedier execution) based upon architectural advances in the cpu, rather than thinking about optimizing the code itself....
(that's a hint, r22!)

Xorpd! wrote:
Difficult problems admit the possibility of making progress in several steps, and it seems useful to me to check whether any step taken is a step forward or a step back.
Seems we agree on somthing! Then, it becomes only a question of whether one is marching forward or sideways, by considering cache, whether cold or hot, since any FFT of genuine utility will be obliged to operate on data sitting not in cache, but in memory, until such time as one can pipe data from the outside world, directly into cache, bypassing RAM.

Smile
Post 08 Nov 2009, 11:19
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Radix4 FFT finally works in Java. Kind of funny, I couldn't figure out why it wouldn't work... the algorithms for the butterfly and the reordering were correct... then I realized that for implementation I was using I had to do the reorder LAST not FIRST.

I have two radix4 algorithms that I'm looking at...
1 - Uses a different (not bit reverse) reordering after the FFT calculations
2 - Uses the bit reverse ordering first and seems to use different omega values (the precomputed values for the complex multiply)
Post 12 Nov 2009, 04:11
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.