flat assembler
Message board for the users of flat assembler.

Index > Heap > Algorithm: FFT (Fast Fourier Transform)

Goto page Previous  1, 2, 3  Next
Author
Thread Post new topic Reply to topic
Borsuc



Joined: 29 Dec 2005
Posts: 2466
Location: Bucharest, Romania
Borsuc
@tom: "single precision" means 32-bit float, so 32-bit integers would be 0.8 microseconds or less, not 1.4 as you said.

1.4 is for "double precision" aka 64-bit.

_________________
Previously known as The_Grey_Beast
Post 28 Oct 2009, 15:25
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Once I get the ASM code done; still in the java prototyping/algorithm stage. This thread can be moved to MAIN (or I'll just start a new one) and we can benchmark my code against those math libraries.

I plan on owning the fastest implementation of FFT 1D double precision on the x86-64 platform..
Post 28 Oct 2009, 16:33
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17278
Location: In your JS exploiting you and your system
revolution
r22 wrote:
I plan on owning the fastest implementation of FFT 1D double precision on the x86-64 platform..
A laudable goal, but which CPU/memory/chipset/OS combos will you be targeting? Do you plan to implement dynamic code path selection based upon CPU/cache/memory/FFT metrics? What FFT size range are you going to optimise for? Hmm, what else ... Oh yeah, don't forget about the TLB cache set associativity, many people seem to forget about that and end up with crap performance because of TLB collisions causing extra page table walks.

And the biggest Q of all, how will you judge "fastest"? Not as easy to judge as it may seem. You need to precisely state your measuring criteria else you will never know if you really have the "fastest" or not.
Post 29 Oct 2009, 11:28
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Quote:
A laudable goal, but which CPU/memory/chipset/OS combos will you be targeting?

As a stated x86-64 CPU. Memory should be enough to fit the benchmark code and test arrays. Chipset should be irrelavent. OS if the FFT code is in-place it would run on any x86-64 compatible OS, if not memory allocation will need to be tuned per OS.
Quote:
Do you plan to implement dynamic code path selection based upon CPU/cache/memory/FFT metrics?

Nope, possibly different code paths based on FFT parameters in later iterations.
Quote:
What FFT size range are you going to optimise for?

Haven't decided, maybe relatively small like 256 to 4K
Quote:
Hmm, what else ... Oh yeah, don't forget about the TLB cache set associativity, many people seem to forget about that and end up with crap performance because of TLB collisions causing extra page table walks.

Mostly irrelevant if I choose a small data set size. 4096 entries * 16 (2 doubles) is only 64K. Aligned 16byte data elements that are 16bytes in size (1 Complex Double Real and Imaginary) won't get straddled on cache lines and even low end x-64 processors have enough cache and TLB slots for this size. If I choose an algorithm like Stockham Autosort for the FFT, which uses a temp array I'll need to consider MOVNTDQ and PREFETCHNTA.
Post 29 Oct 2009, 16:25
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17278
Location: In your JS exploiting you and your system
revolution
x86-64 CPU is ambiguous: Intel, AMD? Single core? Dual? Quad? NUMA? SMP? They all have different methods for optimisation.

Chipset and memory path is not irrelevant, you have to get the data from RAM into cache somehow. This will affect your test result timings (which, BTW, you have not yet defined).

Without different code paths for CPU/cache/memory then you won't be able to make the "fastest" FFT implementation. Perhaps only optimised for your own system only and not optimised for the other cases.

Your raw data set size, 4kB x 16, is 64kB (plus program variables and temps and whatnot). That won't all fit into L1 data cache on most CPUs. You will have to consider the cache subsystem in your code if you want "fast" code. But ignoring the TLB associativity in FFT algos is going to hurt you terribly. The FFT transform access patterns match really badly for TLB access on most CPUs (unless they have full associativity). You might find that staggering the data arrays will give a significant performance boost. Just saying, because I faced this problem already some time ago.
Post 29 Oct 2009, 18:13
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
revolution wrote:
x86-64 CPU is ambiguous: Intel, AMD? Single core? Dual? Quad? NUMA? SMP? They all have different methods for optimisation.

Chipset and memory path is not irrelevant, you have to get the data from RAM into cache somehow. This will affect your test result timings (which, BTW, you have not yet defined).

Without different code paths for CPU/cache/memory then you won't be able to make the "fastest" FFT implementation. Perhaps only optimised for your own system only and not optimised for the other cases.

Your raw data set size, 4kB x 16, is 64kB (plus program variables and temps and whatnot). That won't all fit into L1 data cache on most CPUs. You will have to consider the cache subsystem in your code if you want "fast" code. But ignoring the TLB associativity in FFT algos is going to hurt you terribly. The FFT transform access patterns match really badly for TLB access on most CPUs (unless they have full associativity). You might find that staggering the data arrays will give a significant performance boost. Just saying, because I faced this problem already some time ago.


Before ANY of this is relevant I need to pick an algorithm.

As for chipset level optimizations, STOP TROLLING :D, citations needed.

Threading/core count will be based on the data set size range, multi-threading with a small data set would likely be slower. So for now the algorithm is tentatively SINGLE THREADED.

I'm hoping my implementation will be the fastest possible IN COMPARISON to the current math library implementations. NOT the literal "fastest possible".

And finally benchmarking.
Run-time initialized data array, time in, many iterations of FFT, time out.
Post 29 Oct 2009, 18:45
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
rugxulo



Joined: 09 Aug 2005
Posts: 2341
Location: Usono (aka, USA)
rugxulo
r22 wrote:

As for chipset level optimizations, STOP TROLLING Very Happy, citations needed.


(Don't you hate the way that gets bandied about so much these days? Ah well, I see the smiley now, but still ... already found a few pics to "lighten" the mood, heh):

Image

Image


Laughing
Post 29 Oct 2009, 20:44
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
UPDATE
Just tried out a java implementation Stockham Autosort and timing wise it seems to be slower (~50%) than the reorder + Cooley-Tukey method.

I had hopes for the Stockham method, thinking the tmp array would out-do the reorder but no dice. Apparently my view of temp arrays was skewed by my previous work with RadixSort (beating QuickSort) <link to fasm thread no one will click on and I'm too lazy to search for>.

@rugxulu, STOP SPAMMING Very Happy, filling my thread with silliness
PUSH [Serious_Business]
Post 29 Oct 2009, 23:07
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
revolution
When all else fails, read the source


Joined: 24 Aug 2004
Posts: 17278
Location: In your JS exploiting you and your system
revolution
r22 wrote:
I'm hoping my implementation will be the fastest possible IN COMPARISON to the current math library implementations. NOT the literal "fastest possible".
Yes, of course. But you still need to define how you will compare. Under what circumstances?, which CPU?, what test data set?, etc. Otherwise any comparisons will not mean anything.

It seems that your parameters for benchmarking are very limited. Have you set these parameters based upon an expectation that those are the most commonly used data sets? It would be a shame to be optimising for something that is not used often.
Post 30 Oct 2009, 02:17
View user's profile Send private message Visit poster's website Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias
r22 wrote:

I plan on owning the fastest implementation of FFT 1D double precision on the x86-64 platform..
a laudable goal...I had the same goal in 1981, actually started out on it, two years later. Published my results ten years later. At that time, I was the first with color display (1280 x 1024) (previously all monochrome) of real time spectral analysis of human phonemes on an ordinary desktop computer, which, in those days, meant, 486/DX2-66, with 16 megabytes of RAM, cost 3K$. In the decades preceding, the same capability, with however, only black and white imaging, required special purpose computers, costing 10k$, minimum.
revolution wrote:

And the biggest Q of all, how will you judge "fastest"?
Measuring time to execute the FFT is not a trivial task. In my case, since I was displaying the power spectrum, I had an additional computation to complete, after the FFT, itself. The biggest challenge of course, is to find a method of time measurement which does not interfere with execution of the FFT.
r22 wrote:
Before ANY of this is relevant I need to pick an algorithm.
Absolutely agree.
r22 wrote:

So for now the algorithm is tentatively SINGLE THREADED.
That's good. Then, perhaps, one could envision a dual core cpu with one core measuring the time, while the other executes the FFT.
r22 wrote:

I'm hoping my implementation will be the fastest possible IN COMPARISON to the current math library implementations. NOT the literal "fastest possible".
yeah, I had a different plan, back then. Not sure if mine really was the fastest, thought I always claimed it was, simply because it was so much faster, by an order of magnitude, compared with all of the other extant, published FFT's of that era.
r22 wrote:

And finally benchmarking.
Run-time initialized data array, time in, many iterations of FFT, time out.
yeah, here we diverge, radically.
I already mentioned, in a previous post, my "real world" application, meaning, that the testing had to be performed on the code implementing the actual paradigm, rather than something artificial, as you suggest here.
In other words, you are setting up a testing method which guarantees that the data will be sitting in cache, so naturally, the exection will be faster than running your own code on data sitting instead in RAM, not cache. The data in the real world, enters the computer's RAM, not the cpu's cache. Try to time a single iteration, instead of repeating multiple iterations, with subsequent division of the measured time by the quantity of iterations. I think you will discover that it is much more challenging to measure the time for a single trial!!!

The data must be random, not preassigned. The data, further, ought to be sent in, then, an interrupt generated once the 4096 points (in my case, 256 points) have been sampled, and the time test begun upon arrival of that interrupt.

Smile
Post 30 Oct 2009, 11:43
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
UPDATE
I've started the ASM implementation.

I'm going to use the data section to hold the pre-calculated omega values and pre-calculated reverse bit order indexes. I'm thinking the final version would have the pre-computed values for 256, 512, 1024, 2048 and 4096 elements since that is the range I'm optimizing for.

Also, the FFT algorithm itself will be for 256 elements. This is just so I can easily test the output to make sure it's accurate, then I'll make it generic for power of 2 length element arrays and begin the tuning.

I may have code to post this week, depending on rl obligations. If I do it won't be optimized yet.

BENCHMARK @tom tobias
Testing wise I've decided on single passes with run-time initialized data.
Following...
1->Real-time priority Thread and Process APIs
2->VirtualAlloc: create the data array
3->Fill data array
4->RDTSC: start time
5->Call FFT
6->RDTSC end time
7->Repeat from 2

But benchmarking is still irrelevant at this point.
Post 03 Nov 2009, 13:52
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias
r22 wrote:
...I may have code to post this week,...
That is incredible.
I did not realize how truly ignorant I am, until watching some genuinely skillful programmers churn out programs in a day, that took me two months, just to figure out what I had hoped to accomplish, and a year and a half, to implement.

r22 wrote:
...But benchmarking is still irrelevant at this point.
Yes, perhaps, however, your algorithm will affect the timing measurements, so that while the implementation details of the measurement are irrelevant at the moment, as you have indicated, the method of executing the FFT for testing will have an impact on your design. For example:

r22 wrote:

2->VirtualAlloc: create the data array
As I wrote last time, I don't think that this step will result in a meaningful measurement of the FFT execution time, once that task is set in motion, because your data will already be in cache:
Quote:
3->Fill data array

I agree with you, of course, that right now, your top priority is writing the FFT itself. To test my FFT, I used a battery powered frequency generator. I suppose there are far more sophisticated methods available today!!!
Smile
Post 04 Nov 2009, 09:53
View user's profile Send private message Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
I considered the caching issue, but didn't post about it.

The VirtualAlloc will create a block of memory much lager than the processor's cache (~50MB).

It will then initialize this memory from beginning to end; the initialization loop will have an SSE PREFETCH opcode in it hinting the L2 cache. I'm hoping that only the end of the memory block will be held in cache by the end of the initialization

Since, only the beginning of the memory block will be sent to the FFT function the data will hopefully be in RAM rather than Cache.

*addendum* as for coding speed, I already have the algorithm in Java and I already have my musings about the SIMD implementation of the butterfly (main calculation) portion of the FFT. So I just need the time and willpower (HLL -> ASM is tedious) to write it all down in ASM.
Post 04 Nov 2009, 16:55
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Here is FFT256 with a single run benchmark designed for worst time.

Haven't checked it for correctness, but just the thought of me being wrong is laughable :D :D :D

Dividing the Cycle count from the RDTSC timing by my processor's frequency comes to a speed around 0.00004 sec (1 execution FFT 256 elements).

Next steps:
- Actually check the output for correctness
- Optimize
- - the SSE code in FFT256 is bad, currently it's using the code I wrote in my browser window and posted earlier in this thread (there was 1 compile error MOVHLPD apparently doesn't exist had to use MOVHLPS)
- - the register usage is all over the place and should be fine tuned
- - the loops are straight ports from java, should modify the counters to act as addressing registers as well as unroll to take advantage of the extra XMMx registers
- Add math library implementations to the benchmark for comparison
- Make 512, 1024, 2048, 4096 element, and a generic N element version of the FFT for completeness IF I'm satisfied with the performance.

Alas, tomorrow I have to rework 4000 lines of 10 year old ASP + IE5 JavaScript + inline CSS to a properly styled cross browser compatible markup. As a result, I'll likely be too cranky to program for a few days.

*EDIT* 11/5 new code file


Description: Unoptimized 256 element FFT + benchmark (x86-64 Win)
Download
Filename: FFT Benchmark.ASM
Filesize: 16.62 KB
Downloaded: 91 Time(s)



Last edited by r22 on 05 Nov 2009, 23:56; edited 1 time in total
Post 05 Nov 2009, 03:05
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Quote:
Alas, tomorrow I have to rework 4000 lines of 10 year old ASP + IE5 JavaScript + inline CSS to a properly styled cross browser compatible markup. As a result, I'll likely be too cranky to program for a few days.

~2300 lines to go !

*GASP* I'm spamming my own thread :/
Post 05 Nov 2009, 16:48
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Xorpd!



Joined: 21 Dec 2006
Posts: 161
Xorpd!
I can't really comment until you get the testing infrastructure built. Since things are going to get really complicated in a hurry if you want a fast FFT, you are going to need a test for correctness that you can rely on. I couldn't compile your program, but maybe I just have an old version of FASM.
In your inner loop, I wish to point out that PSHUFD takes 2 μops but MOVDDUP, UNPCKH/LPD, and SHUFPD only one on a Core 2 Duo. Part of the fun of programming SIMD is trying to make sure the data swizzling operations don't eat up all the time. This is more difficult with single precision, of course. There are algorithms for the FFT that avoid this problem, but we'll get to those on another day.
Also using PADDQ to flip the sign bit is rather obscure. Clearer would be XORPD, but you should know that there is a useful instruction called ADDSUBPD.
Post 05 Nov 2009, 17:56
View user's profile Send private message Visit poster's website Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
UPDATE

- Verified it's correctness (ran a trusted FFT algo with the same input in java and of course the output matched up)

- Used some of Xorpd!'s suggestions and tuned the INNER LOOP
Code:
        movddup xmm14, qword[FFT256_WK + r15] ;; WK[LEVEL][J]
        movddup xmm15, qword[FFT256_WK + r15 + 8]
;;for (int i=j; i<SAMPLES; i=i+(step << 1))
        mov     edx, ebx ;; I = J
.I:
        lea     r14, [rdx + rax] ;;STEP + I
        mov     r13, rdx ;; I
        shl     r14, 4 ;; * 16
        shl     r13, 4 ;; * 16
        ;;(first pass) FFT calculation
        MOVDQA  XMM4, dqword[rsi + r13] ;;XMM2 = I[A] | R[A]
        MOVDQA  XMM1, dqword[rsi + r14] ;;XMM1 = I[B] | R[B]
        MOVDQA  XMM0, XMM14 ;;XMM0 = P[C] | P[C]
        MOVDQA  XMM2, XMM15 ;;XMM2 = Q[C] | Q[C]
        PSHUFD  XMM3, XMM1, 01001110b ;;XMM3 = R[B] | I[B]
        MOVDQA  XMM5, XMM4 ;;XMM5 = I[A] | R[A]
        MULPD   XMM0, XMM1 ;;XMM0 = P[C] * I[B] | P[C] * R[B]
        MULPD   XMM2, XMM3 ;;XMM2 = Q[C] * R[B] | Q[C] * I[B]
        ;;need to flip the sign bit of the low QWORD/DOUBLE of XMM2
       ADDSUBPD XMM0, XMM2 ;;XMM0 = P[C] * I[B] + Q[C] * R[B] | P[C] * R[B] - Q[C] * I[B] ;;XMM0 = tmpi | tmpr
        add     edx, r12d
        ADDPD   XMM4, XMM0 ;;Complex ADD XMM1 = I[A] + tmpi | R[A] + tmpr
        SUBPD   XMM5, XMM0 ;;Complex SUB XMM2 = I[A] - tmpi | R[A] - tmpr
        MOVDQA  dqword[rsi + r13], XMM4 ;;store
        MOVDQA  dqword[rsi + r14], XMM5
        cmp     edx, __SAMPLES
        jb      .I
;;NEXT I 
    


I'm now getting timings around 0.000010 sec (10 micro seconds) for a 256 element FFT single execution.

I'll update the previous post with the new code file.
I am using FASM 1.69.10 to compile.

I think optimizing the loops, indexing and addressing are next up.

Suggestions and comments are appreciated.
Post 05 Nov 2009, 23:53
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
Madis731



Joined: 25 Sep 2003
Posts: 2141
Location: Estonia
Madis731
Actually on ASM board, the more relevant measurement of time would be clocks rather than µs Smile If it takes 10µs on a PMMX@200MHz, then its impressive, but 10µs on a 3.2GHz i7 isn't that much to brag about. At least you should always give clock speed and architecture, if you give seconds. Thanks!

I think you know where I'm going with this. I think we should compare implementations based on clocks/complex multiply or something.
Post 06 Nov 2009, 08:45
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger Reply with quote
r22



Joined: 27 Dec 2004
Posts: 805
r22
Good point. The benchmark does output the RDTSC time.

1.8GHz Core2 6300 w/ 2 GB DDR2 800 RAM
My timings are between 4000h and 4300h cycles (16000 - 17000)

After I'm satisfied with the optimization (I may end up using Radix4 because it's less Complex Adds and Muls) I'll enhance the benchmark to give a more accurate cycle count as well as speed.
Post 06 Nov 2009, 16:39
View user's profile Send private message AIM Address Yahoo Messenger Reply with quote
tom tobias



Joined: 09 Sep 2003
Posts: 1320
Location: usa
tom tobias
r22 wrote:
...(I may end up using Radix4 because it's less Complex Adds and Muls)...
Umm, I think you will likely end up with Duhamel's split radix, because you plan to write this FFT for half a dozen scenarios each of which computes the FFT for a quantity of points representing a multiple of 4: 256-->4096. Yes, as you noted, the Radix4 has fewer complex additions and multiplications than Radix2. I believe, without really understanding the mathematics, that the combination of radix2 and radix4 is more efficient in implementing the Cooley Tukey algorithm, for that range of data points, than either radix2 or radix4 alone:
H. V. SORENSEN, M. T. HEIDEMAN, AND C. S. BURRUS, On computing the split-radix FFT, IEEE Transactions on Acoustics, Speech and Signal Processing, 34 (1986), pp. 152–156.

Smile
Post 07 Nov 2009, 02:24
View user's profile Send private message Reply with quote
Display posts from previous:
Post new topic Reply to topic

Jump to:  
Goto page Previous  1, 2, 3  Next

< Last Thread | Next Thread >
Forum Rules:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum


Copyright © 1999-2020, Tomasz Grysztar.

Powered by rwasa.