FindSabrina.org provides this page as a resource to the AltiVec community. Sabrina's father, Greg Allen, is an electrical engineer and a long-time AltiVec enthusiast.
|
|
AltiVec is a set of instructions for the PowerPC processor which enhance the performance of signal and image processing using a SIMD (single-instruction multiple-data) approach. AltiVec is a trademark of Motorola, and was developed for their MPC74XX family of processors. Apple Computer calls this the PowerPC G4 with Velocity Engine. AltiVec is also present in the PowerPC 970 from IBM. |
The Fast Fourier Transform (FFT) is one of the cornerstone algorithms of signal processing, and a fast implementation is frequently desired. This page provides several different implementations in AltiVec. However, the FFTW project now provides support for SIMD/AltiVec. Based on my benchmarks below, fftw-3.0-fma is the fastest open-source FFT implementation available. These implementations have been tested on MacOS X v10.2 using the built-in compiler, and on Yellow Dog Linux 2.3 using a snapshot of GCC 3.3. They also compile under the old patched gcc-2.95 compiler that was formerly at AltiVec.org. Full source code is provided, and you are welcome to try to improve on them. All of these implementations use interleaved real and imaginary data, and support only radix-2. If you have other (especially non-interleaved) versions, please send them for inclusion.
|
These vBigDSP is the fastest of these FFTs over most of the benchmarks. It is run with two different compilers and operating systems to illustrate that this can make a significant performance difference. Although Jaguar appears to be generally faster for vBigDSP, this was not the case for all of the FFT implementations. moto_fft does well for very small FFTs, but gives a bit-reversed output. When the output is descrambled (using my admittedly unoptimized function), the performance is poor. cfft2 is very slow, and included only for another data point. We also included FFTW 2.1.5 (which does not use AltiVec) for comparison with the others. Please note that fftw-3.0-fma is faster than any of the codes provided here. [Note: vBigDSP does not compile on CodeWarrior because it lacks valloc(). You can substitue malloc(), which gives vector-aligned data on MacOS X. The AltiVec PIM says we should use vec_malloc(), but that is missing from both platforms!] All benchmarks are on one-dimensional complex radix-2 FFTs with interleaved data. FLOPS are computed as 5 N log2 N, divided by execution time. The execution time is the mean for a single transform, computed over 0.25 second. MFLOPS are 10^6 FLOPS. Here is the code and the Matlab script that generated this graph. |
|
Update 7/2/03 This benchmark is run on the same computers as above, but software has been upgraded. This was benchmarked on MacOS X v10.2.6 using the built-in compiler, and on Yellow Dog Linux 3.0 using the release version of GCC 3.3. vBigDSP performs roughly the same as before, despite the upgrades. vBigDSP on Jaguar was previously the best-performing FFT, but fftw-3.0-fma is now clearly faster on both platforms. (I knew those guys would win eventually.) Here is the code and the Matlab script that generated this graph. It is important to point out that all of these implementations use interleaved complex data. This data is organized in memory as: real | imag | real | imag, as opposed to split complex data which has separate arrays for the real and the imaginary data. MacOS X has a library called vDSP which uses split complex data and is nearly twice as fast in some instances. Source code is not available, so it doesn't exist on Linux. If you have an open-source split-complex AltiVec FFT, please send it for comparison! |
|
Update 11/13/03 Apple's vDSP library performs very well over many of the sizes shown here. In several cases, it performs at better than 50% of the peak for AltiVec! The 1024-length in-place case has a performance anomaly about 10% of the time (which is exaggerated in this plot). Most of the time, this case performs at about 9.2 GFLOPS -- more in line with the others. Unfortunately, this library is only available on MacOS, only does radix-2 FFTs, and is closed-source. The authors of FFTW have also run these benchmarks on the same machine using IBM's beta compilers for MacOSX, XL C/C++ V6.0 Beta and XL Fortran V8.1 Beta. FFTW is considerably slower using XL C. These benchmarks have not yet been run under Linux. I'll run my benchmarks again shortly when Linux is better supported on the G5. |
|
|
For more information contact: Greg Allen <gallen@arlut.utexas.edu>
![]()