flat assembler
Message board for the users of flat assembler.
Index
> Heap > Algorithm: FFT (Fast Fourier Transform) Goto page Previous 1, 2, 3 
Author 

revolution
Xorpd! wrote: And I wish you hadn't talked CHClF2 into timing with cold instruction and data caches. Under those conditions it's hard to see the signal of a possible small improvement in code efficiency against the noise of loading instructions and data from memory. Also it makes it hard to compare with published results. At least let him do 4 transforms starting with the same data set, printing out timing results for each transform. In this way you can see the effect of memory on the first transform and the rest of us can see the effect of coding improvements in the last 3. This small number should be sufficient to warm up caches and BTBs yet not cause overflow problem. 

07 Nov 2009, 08:35 

tom tobias
Xorpd! wrote: For a second there I thought you were going to reference Sorenson, Jones, Heideman, & Burris, Realvalued fast Fourier transform algorithms, IEEE Transactions on Acoustics, Speech and Signal Processing 35 (1987), pp. 849863. Xorpd! wrote: The problem with split radix (due to Yavne (1968), not Duhamel (1984)) ... Xorpd! wrote: ....this increases the number of loads and stores. hahaha. I think that premise may be open to investigation....Or, maybe I am wrong....Hmm. Makes me almost want to get out my old box of floppies, and dig out the ones where I did the time measurements, back in '923. Xorpd! wrote:
Xorpd! wrote:
Xorpd! wrote:
Xorpd! wrote:
First of all, the "noise" associated with signalling the onset of computation, once all 4096 points have been loaded into memory, is equal to the time needed to handle the interrupt, and nothing more. In competent hands, not my own, obviously, that task is both trivial, and essentially timeless, compared with the huge amount of time needed to execute the FFT. But, even if it takes one microsecond, so what? The time measurement does not begin until AFTER the interrupt has been received. 1. pump in the data; 2. all data gathered? 3. no: back to 1; 4. yes: send interrupt; 5. start timer 6. compute FFT 7. compute Power Spectrum 8. measure time 9. display result. There is no "noise" associated with "loading data". Xorpd! wrote:
Xorpd! wrote:
revolution wrote: In all my FFT programming I have found that the actual FFT is not a significant delay, it is always acquiring the data and getting it into the CPU that takes the most time. For example, with Magnetic Resonance Imaging, particularly fMRI, one is dealing with a much faster sampling frequency than for human voice. So, I was focused on frame time, i.e. the percentage of time required to compute the FFT, relative to the amount of time needed to acquire the data. In my case, best time ever measured, using the fastest cpu I ever worked with, the 486/DX2: 66MHz, the execution time was about 1 millisecond, representing about 4% of the frame time. For that era, that execution time was speedy gonzalez. revolution wrote:


07 Nov 2009, 21:54 

r22
A few points...
xorpd!  My forum handle has nothing to do with the ozone killing byproduct of aerosol cans, although it is an amusing coincidence.  Changing how the code is initialized, timed and benchmarked is trivial, so I plan to have as many timing metrics as necessary. revolution  If you are aware of an efficient means/ordering/algorithm to bring the FFT data to the processor cache then by all means elaborate. Since your earlier posts alluded to per chipset optimizations I can only assume you don't consider there to be a proper enough generalized solution. Realtime or post processed, cached or in RAM, if the algorithm uses a more efficient flow and opcodes then it's a superior implementation despite whether it runs an order of magnitude faster or not. I'd prefer a general case solution, but I'd be satisfied if in the end it was the fastest on only a small subset of hardware. x8664 likes to work with 64byte data blocks (Intel/AMD arch optimization manuals), which makes the case for using radix4 for double precision FFT. Arithmetic wise using a radix16 FFT for 256 elements would seem optimal because you'd be able to easily unroll (2x) the outer loop (the recursive 'level'), also the number of adds and muls would be lower. However, the inner loop would be fairly XMMx register starved. This subject is holding my attention longer than most, so perhaps next week I'll get some more code up on this thread. I'm thinking radix4 FFT 256 (optimized to about the level that the current code is) with a more robust benchmark. Aside: Some of the FFT implementations I've found online are in Fortran or converted to C from Fortran. I was forced to remember how embarrassingly horrid Fortran syntax is, especially the variable naming convention. 

07 Nov 2009, 23:39 

revolution
r22 wrote: revolution  If you are aware of an efficient means/ordering/algorithm to bring the FFT data to the processor cache then by all means elaborate. Since your earlier posts alluded to per chipset optimizations I can only assume you don't consider there to be a proper enough generalized solution. 

08 Nov 2009, 00:09 

Xorpd!
tom tobias wrote: Back in '86 I was living in Houston, where Burrus was the famous professor of Electrical Engineering at Rice University, I read all of his papers He wrote about 700 papers, IIRC. Quite a lot of reading. tom tobias wrote:
No PM, sorry. Algorithmic choices are more interesting now than in 486 days because now you have to deal with pipelining and SIMD. tom tobias wrote: But, friend, if we are not talking about real world applications, with real numeric data arriving at the computer, The real world is a different reality for all participants. I have seen an algorithm where an FFT was just a step in an iterative process. Caches can stay warm in that context. Even if you are going to start with cold caches I would try to benchmark separate aspects of the problem just as I would modularize code development and testing. Difficult problems admit the possibility of making progress in several steps, and it seems useful to me to check whether any step taken is a step forward or a step back. 

08 Nov 2009, 05:14 

tom tobias
Xorpd! wrote: The real world is a different reality for all participants. FASM too can be run on a emulator, but, for most of us, its strength lies in its real world applicability. Real world does not mean "virtual reality", or "simulation", it means concrete data, measurable, and quantifiable, like the heat in the Egyptian desert. Using the FFT to solve genuine problems, not simulation exercises, harkens back to that day in the 19th century when Fourier devised his elegant transform. Xorpd! wrote: Algorithmic choices are more interesting now than in 486 days because now you have to deal with pipelining and SIMD. (that's a hint, r22!) Xorpd! wrote: Difficult problems admit the possibility of making progress in several steps, and it seems useful to me to check whether any step taken is a step forward or a step back. 

08 Nov 2009, 11:19 

r22
Radix4 FFT finally works in Java. Kind of funny, I couldn't figure out why it wouldn't work... the algorithms for the butterfly and the reordering were correct... then I realized that for implementation I was using I had to do the reorder LAST not FIRST.
I have two radix4 algorithms that I'm looking at... 1  Uses a different (not bit reverse) reordering after the FFT calculations 2  Uses the bit reverse ordering first and seems to use different omega values (the precomputed values for the complex multiply) 

12 Nov 2009, 04:11 

Goto page Previous 1, 2, 3 < Last Thread  Next Thread > 
Forum Rules:

Copyright © 19992020, Tomasz Grysztar.
Powered by rwasa.