CUDA Support Preview

Within the next couple of months we’ll be releasing an update to our Numerical Libraries for .NET that includes support for GPU-accelerated calculations using NVIDIA’s CUDA libraries.

We wanted to make it very easy to offload calculations to the CPU. This means the number of code changes should be minimized. At the same time, we wanted to make optimal use of the GPU. One of the most time-consuming parts of GPU computing is transferring data between CPU and GPU memory. In order to take full advantage of the processing power, data should be kept on the GPU as much as possible.

To support this functionality, we’ve introduced three new types: DistributedProvider, DistributedVector and DistributedMatrix. DistributedProvider is an abstract class that defines core functionality of the distributed computing platform such as memory management and computational kernels. At the moment, it has one concrete implementation: CudaProvider. We expect to add OpenCLProvider and possibly others (Xeon Phi, C++/AMP, MPI…) in the future.

DistributedVector and DistributedMatrix represent distributed (i.e. GPU-based) arrays that may or may not have a local copy in CPU memory. Data is copied to and from GPU memory only as needed. The result of, say, multiplying two matrices in GPU memory is a third matrix which also resides on the GPU. If further calculations are done using this matrix, its data is kept on the GPU. It is never copied to CPU memory, unless individual elements are accessed, or an operation is performed that is not supported on the GPU. In the latter case, the data is copied to local memory and the CPU-based implementation is used. We also fall back on the CPU-based code automatically if the GPU runs out of memory, or if there is no CUDA GPU at all.

So what does the code look like? Well, here is a small example that runs 100 iterations of the power method for computing the largest eigenvalue of a matrix using various configurations. The core computation is a matrix-vector product. We use the same code for the CPU-based and the GPU-based calculation.

static double DoPower(Matrix<double> A, Vector<double> b) { 
    double λ = 0;
    Vector<double> temp = null) {
    for (int i = 0; i < imax; i++) {
        temp = Matrix<double>.MultiplyInto(A, b, temp);
        // Note that temp will exist only on the GPU
        // if A and b are GPU arrays
        var λ1 = temp.Norm();
        Vector<double>.MultiplyInto(1 / λ1, temp, b);
        if (Math.Abs(λ1 - λ) < 1e-5) break;
        λ = λ1;
    // In case this is a GPU array: free GPU memory.
    return λ;

// CPU:
l = DoPower(A, b);

// GPU:
l = DoPower(A.MakeDistributed(), b.MakeDistributed());

As you can see, the only difference between the CPU and GPU versions is that we called MakeDistributed on the input arguments.

In our benchmark we added a third variation that touches the matrix on the CPU during each iteration. This forces the matrix to be copied to the GPU in each iteration, which is similar to what happens with naive offloading. Here are the results:

Size: 100
MKL (CPU only):        4.107 ms (lambda=6.07967352075151)
CUDA (keep on GPU):   25.101 ms (lambda=6.07967352075151)
CUDA (offloaded):     29.593 ms (lambda=6.07967352075151)

Size: 500
MKL (CPU only):       42.116 ms (lambda=13.3132987677261)
CUDA (keep on GPU):   30.376 ms (lambda=13.3132987677261)
CUDA (offloaded):     94.250 ms (lambda=13.3132987677261)

Size: 1000
MKL (CPU only):      171.170 ms (lambda=18.754878830699)
CUDA (keep on GPU):   35.196 ms (lambda=18.754878830699)
CUDA (offloaded):    276.329 ms (lambda=18.754878830699)

Size: 5000
MKL (CPU only):    4397.868 ms (lambda=41.3752599052634)
CUDA (keep on GPU): 282.907 ms (lambda=41.3752599052635)
CUDA (offloaded):  5962.417 ms (lambda=41.3752599052635)

This is on a GPGPU-poor GTX 680, with 1/24 double-precision capacity, compared to 1/2 for Titan/Kepler cards, and using Intel’s Math Kernel Library version 11. It clearly shows that using our implementation, the GPU version is competitive for moderate sized matrices (n=500) and really charges ahead for larger problems, while the simple offloading technique never quite catches up.