FindSabrina.org provides this page as a resource to the AltiVec community. Sabrina's father, Greg Allen, is an electrical engineer and a long-time AltiVec enthusiast.


AltiVec is a set of instructions for the PowerPC processor which enhance the performance of signal and image processing using a SIMD (single-instruction multiple-data) approach. AltiVec is a trademark of Motorola, and was developed for their MPC74XX family of processors. Apple Computer calls this the PowerPC G4 with Velocity Engine. AltiVec is also present in the PowerPC 970 from IBM.


AltiVec FFT Implementations

The Fast Fourier Transform (FFT) is one of the cornerstone algorithms of signal processing, and a fast implementation is frequently desired. This page provides several different implementations in AltiVec. However, the FFTW project now provides support for SIMD/AltiVec. Based on my benchmarks below, fftw-3.0-fma is the fastest open-source FFT implementation available. These implementations have been tested on MacOS X v10.2 using the built-in compiler, and on Yellow Dog Linux 2.3 using a snapshot of GCC 3.3. They also compile under the old patched gcc-2.95 compiler that was formerly at AltiVec.org. Full source code is provided, and you are welcome to try to improve on them. All of these implementations use interleaved real and imaginary data, and support only radix-2. If you have other (especially non-interleaved) versions, please send them for inclusion.

  • vBigDSP - This is the fastest implementation provided here. It is based on source code released by Apple. It has been modified to reduce dependency on the classic MacOS, and so that it runs on Linux. It comes with a self-test program. Unfortunately, it is not reentrant or thread safe.
  • moto_fft - This is based on the implementation in Motorola's Application Note AN2115/D. I have added functions to initialize the weight table, and to descramble the bit-flipped output. I have also changed the API to eliminate the need for log2(N). Unfortunately, the inverse fft is not working properly.
  • cfft2 - That's "Complex FFT, radix 2". This comes from its Fortran roots, and was provided to me by someone on the AltiVec mailing list. It is not fast, but seems to work.


AltiVec FFT Performance

These benchmarks were run on an Apple Power Macintosh G4 with dual 800 MHz PowerPCs. [Only one of the PowerPCs is actually used in the benchmark.] All but one were run on Yellow Dog Linux 2.3 with the stock SMP kernel (2.4.19-4asmp) and gcc-3.3-20030303. The one maked "Jaguar" was run on MacOS X 10.2.4.

vBigDSP is the fastest of these FFTs over most of the benchmarks. It is run with two different compilers and operating systems to illustrate that this can make a significant performance difference. Although Jaguar appears to be generally faster for vBigDSP, this was not the case for all of the FFT implementations.

moto_fft does well for very small FFTs, but gives a bit-reversed output. When the output is descrambled (using my admittedly unoptimized function), the performance is poor.

cfft2 is very slow, and included only for another data point. We also included FFTW 2.1.5 (which does not use AltiVec) for comparison with the others. Please note that fftw-3.0-fma is faster than any of the codes provided here.

[Note: vBigDSP does not compile on CodeWarrior because it lacks valloc(). You can substitue malloc(), which gives vector-aligned data on MacOS X. The AltiVec PIM says we should use vec_malloc(), but that is missing from both platforms!]

All benchmarks are on one-dimensional complex radix-2 FFTs with interleaved data. FLOPS are computed as 5 N log2 N, divided by execution time. The execution time is the mean for a single transform, computed over 0.25 second. MFLOPS are 10^6 FLOPS. Here is the code and the Matlab script that generated this graph.


Update 7/2/03 - fftw-3.0 was released in April 2003, and officially supports SIMD/AltiVec. It has a special "fused multiply-add" version, fftw-3.0-fma, for processors (such as PowerPC) with these instructions.

This benchmark is run on the same computers as above, but software has been upgraded. This was benchmarked on MacOS X v10.2.6 using the built-in compiler, and on Yellow Dog Linux 3.0 using the release version of GCC 3.3.

vBigDSP performs roughly the same as before, despite the upgrades. vBigDSP on Jaguar was previously the best-performing FFT, but fftw-3.0-fma is now clearly faster on both platforms. (I knew those guys would win eventually.)

Here is the code and the Matlab script that generated this graph.

It is important to point out that all of these implementations use interleaved complex data. This data is organized in memory as: real | imag | real | imag, as opposed to split complex data which has separate arrays for the real and the imaginary data.

MacOS X has a library called vDSP which uses split complex data and is nearly twice as fast in some instances. Source code is not available, so it doesn't exist on Linux. If you have an open-source split-complex AltiVec FFT, please send it for comparison!


Update 11/13/03 - The authors of FFTW have recently completed a full series of BenchFFT benchmarks on my dual-processor 2.0 GHz Power Mac G5. [Only one processor is actually used in the benchmark.] These are the 1-D complex radix-2 AltiVec results, and their full benchmark series is here. We used MacOS X 10.3 Panther and Xcode 1.0.1, which provides gcc-3.3. Clearly, you should be using FFTW in favor of the vBigDSP library provided here.

Apple's vDSP library performs very well over many of the sizes shown here. In several cases, it performs at better than 50% of the peak for AltiVec! The 1024-length in-place case has a performance anomaly about 10% of the time (which is exaggerated in this plot). Most of the time, this case performs at about 9.2 GFLOPS -- more in line with the others. Unfortunately, this library is only available on MacOS, only does radix-2 FFTs, and is closed-source.

The authors of FFTW have also run these benchmarks on the same machine using IBM's beta compilers for MacOSX, XL C/C++ V6.0 Beta and XL Fortran V8.1 Beta. FFTW is considerably slower using XL C.

These benchmarks have not yet been run under Linux. I'll run my benchmarks again shortly when Linux is better supported on the G5.


AltiVec and/or FFT Links


For more information contact: Greg Allen <gallen@arlut.utexas.edu>