The following article is a work in progress, and will be continually updated over time. I'll issue a RELEASE tag or similar when everything has been finished.
Q. How do available FFT routines compare over different libraries and accelerator hardwares?
This is something I have wondered for sometime, though it can be difficult for an apples-to-apples comparison. I will consider the above question in relation to the following plan:
Comparing FFT implementations in a pseudospectral solver for linear and nonlinear Schrodinger equation dynamics
To decide on how to proceed, it is instructive to list all (or most) widely known and used implementations.
Standard implementations
[CPU]
- FFTW: http://fftw.org/ and https://github.com/FFTW/fftw3
- MKL-FFT: https://software.intel.com/en-us/articles/the-intel-math-kernel-library-and-its-fast-fourier-transform-routines
- clFFT: https://github.com/clMathLibraries/clFFT
- Eigen FFT (using kissfft backend): http://eigen.tuxfamily.org/index.php?title=EigenFFT
- sFFT: http://groups.csail.mit.edu/netmit/sFFT/code.html
- KFR FFT: https://github.com/kfrlib/fft
[GPU]
- cuFFT: http://docs.nvidia.com/cuda/cufft/index.html
- clFFT: https://github.com/clMathLibraries/clFFT
- fbfft: https://github.com/facebook/fbcuda/tree/master/fbfft and https://arxiv.org/abs/1412.7580
[Cutting edge]
- FPGA FFT (Xilinx and Altera/Intel): https://www.xilinx.com/products/intellectual-property/fft.html and https://www.altera.com/products/intellectual-property/ip/dsp/m-ham-fft.html
- IBM TrueNorth FFT: No link/source found (yet)
- Rigetti pyQuil quantum FFT: http://pyquil.readthedocs.io/en/latest/getting_started.html (currently emulated AFAIK, but will eventually/hopefully run directly on quantum hardware)
Test cases
To ensure sufficient time is spent in the FFT routines I will opt for a 2D transform of an NxN grid, requiring N transforms, a transpose, and another N. The value of N can be scaled to determine sweet spots for all implementations. Granted, memory size will be the determining factor of how large we can go. N=2j where j∈Z,7≤j≤11 is a reasonable range of values to examine that should fit readily on most hardware.
For this, the following test precisions will be instructive:
C2C - 32-bit float (32-bits for each real and imaginary component)
C2C - 64-bit float
R2C - 32-bit float
R2C - 64-bit float
As a sample system to investigate, I will consider the case of a quantum harmonic oscillator (QHO). To investigate dynamics and their resulting accuracy, a known and analytically calculable result is useful. For this, I will opt for a a superposition state of the groundstate along the xy plane, and the first excited state along x with the groundstate along y. To put things more formally:
Ψ(x,y)=Ψ00(x,y)+Ψ10(x,y)√2. To construct the above wavefunction we will assume that the states Ψ00(x,y) and Ψ10(x,y) are outer products of the solutions to the QHO, as given by:
Ψn(x)=1√2nn!(mωxπℏ)14e−mωxx22ℏHn(x), where Hn(x)=(−1)nex2dndxn(e−x2), are the physicists’ Hermite polynomials. As we are looking for the ground and first excited states (n=0,1) we can simplify a lot of the above calculations: H0(x)=1 and H0(x)=2x. Next, we determine the states Ψ0(x) and Ψ1(x) as:
Ψ0(x)=(mωxπℏ)14e−mωxx22ℏ, Ψ1(x)=2x√2(mωxπℏ)14e−mωxx22ℏ.
Along a single dimension, our system can potentially be in any of the allowed harmonic oscillator states. Therefore, the overall state is given by a tensor product of the state along x and that along y. If we assume the dimensionality of the Hilbert space along the i-th orthogonal spacial index is finite and given by di, then the overall system dimensionality (ie the total number of states we can consider) is d=∏idi.
Assuming an n-dimensional system, where each individual dimension is in the groundstate of the harmonic oscillator, we can define the tensor product state as Ψn=Ψ⊗n0=Ψ0⊗Ψ0⋯⊗Ψ0.
While the above is fine for a general case, for the purposes of FFTs we have assumed a much simpler system, dealing with combinations of only 2 states. We can then define them as Ψ00(x,y)=Ψ0(x)⊗Ψ0(y) and Ψ10(x,y)=Ψ1(x)⊗Ψ0(y). Thus, our final superposition state when filling everything in is given by Ψ(x,y)=(mπℏ)12(ωxωy)14e−m2ℏ(ωxx2+ωyy2)(x+1√2).
To simplify life, I will assume some of the above quantities are set to unity (i.e. m=ωx=ωy=ℏ=1). This will not change the dynamics of the simulation, but simplify the equations and the numerical implementation, which can help with improving accuracy. The above equation then becomes Ψ(x,y)=(1π)12e−12(x2+y2)(x+1√2).
For a more detailed explanation and approach to the above have a look at Christina Lee’s (albi3ro) blog post on time evolution, which will cover the method I use next.
To be continued
/***
This document will implement the quantum simulation aspects of the code, in as simple a means as possible.
***/
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#define GRIDSIZE 128
#define GRIDMAX 5
#define PI 3.14159
#define DT 1e-4
//################################################//
struct double2{
double x;
double y;
};
//Define the grid on which the simulation will be run.
struct Grid{
double *x = (double*) malloc(GRIDSIZE*sizeof(double));
double *y = (double*) malloc(GRIDSIZE*sizeof(double));
double *xy2 = (double*) malloc(GRIDSIZE*GRIDSIZE*sizeof(double));
double *kx = (double*) malloc(GRIDSIZE*sizeof(double));
double *ky = (double*) malloc(GRIDSIZE*sizeof(double));
double *kxy2 = (double*) malloc(GRIDSIZE*GRIDSIZE*sizeof(double));
double2 *wfc = (double2*) malloc(GRIDSIZE*GRIDSIZE*sizeof(double2));
double2 *U_V = (double2*) malloc(GRIDSIZE*GRIDSIZE*sizeof(double2));
double2 *U_K = (double2*) malloc(GRIDSIZE*GRIDSIZE*sizeof(double2));
};
void setupGrids(Grid *grid){
double invSqrt2 = 1/sqrt(2);
//Position space grids
for (int j=0; j < GRIDSIZE; ++j){
grid->x[j] = -GRIDMAX + j*(2*GRIDMAX)/((double)GRIDSIZE);
grid->y[j] = grid->x[j];
}
for(int i=0; i < GRIDSIZE; ++i){
for(int j=0; j < GRIDSIZE; ++j){
grid->xy2[j+GRIDSIZE*i] = 0.5*(grid->x[i]*grid->x[i]+grid->y[j]*grid->y[j]);
//Evolution operator in position space
grid->U_V[j+GRIDSIZE*i].x = cos(-grid->xy2[j+GRIDSIZE*i]*DT);
grid->U_V[j+GRIDSIZE*i].y = sin(-grid->xy2[j+GRIDSIZE*i]*DT);
//Wavefunction
grid->wfc[j+GRIDSIZE*i].x = sqrt( 1/PI )*exp( -0.5 * ( grid->x[i]*grid->x[i] + grid->y[j]*grid->y[j] ) ) * ( grid->x[i] + invSqrt2 );
grid->wfc[j+GRIDSIZE*i].y = 0.;
}
}
//Momentum space grids
for (int j=0; j < GRIDSIZE; ++j){
grid->kx[j] = (j<(GRIDSIZE/2)) ? j*2*PI/(GRIDMAX) : -(GRIDSIZE - j)*(2*PI/(GRIDMAX));
grid->ky[j] = grid->kx[j];
}
for(int i=0; i < GRIDSIZE; ++i){
for(int j=0; j < GRIDSIZE; ++j){
grid->kxy2[j+GRIDSIZE*i] = 0.5*(grid->kx[i]*grid->kx[i]+grid->ky[j]*grid->ky[j]);
//Evolution operator in momentum space
grid->U_K[j+GRIDSIZE*i].x = cos(-grid->kxy2[j+GRIDSIZE*i]*DT);
grid->U_K[j+GRIDSIZE*i].y = sin(-grid->kxy2[j+GRIDSIZE*i]*DT);
}
}
}
//################################################//
int fileIO(Grid *grid){
//Open files to write
FILE *f_x = fopen("x", "w");
FILE *f_y = fopen("y", "w");
FILE *f_xy2 = fopen("xy2", "w");
FILE *f_kx = fopen("kx", "w");
FILE *f_ky = fopen("ky", "w");
FILE *f_kxy2 = fopen("kxy2", "w");
FILE *f_U_V = fopen("U_V", "w");
FILE *f_U_K = fopen("U_K", "w");
FILE *f_wfc = fopen("wfc", "w");
//No safety checks because I'm a bad person
fwrite(grid->x, sizeof(double), GRIDSIZE, f_x);
fwrite(grid->y, sizeof(double), GRIDSIZE, f_y);
fwrite(grid->xy2, sizeof(double), GRIDSIZE*GRIDSIZE, f_xy2);
fwrite(grid->kx, sizeof(double), GRIDSIZE, f_kx);
fwrite(grid->ky, sizeof(double), GRIDSIZE, f_ky);
fwrite(grid->kxy2, sizeof(double), GRIDSIZE*GRIDSIZE, f_kxy2);
fwrite(grid->U_V, sizeof(double2), GRIDSIZE*GRIDSIZE, f_U_V);
fwrite(grid->U_K, sizeof(double2), GRIDSIZE*GRIDSIZE, f_U_K);
fwrite(grid->wfc, sizeof(double2), GRIDSIZE*GRIDSIZE, f_wfc);
//No moar opin
fclose(f_x); fclose(f_y); fclose(f_xy2); fclose(f_kx); fclose(f_ky);
fclose(f_kxy2); fclose(f_U_V); fclose(f_U_K); fclose(f_wfc);
return 0;
}
//################################################//
int main(){
Grid grid;
setupGrids(&grid);
fileIO(&grid);
}
//################################################//