CUDA Support Preview

Within the next couple of months we’ll be releasing an update to our Numerical Libraries for .NET that includes support for GPU-accelerated calculations using NVIDIA’s CUDA libraries.

We wanted to make it very easy to offload calculations to the CPU. This means the number of code changes should be minimized. At the same time, we wanted to make optimal use of the GPU. One of the most time-consuming parts of GPU computing is transferring data between CPU and GPU memory. In order to take full advantage of the processing power, data should be kept on the GPU as much as possible.

To support this functionality, we’ve introduced three new types: DistributedProvider, DistributedVector and DistributedMatrix. DistributedProvider is an abstract class that defines core functionality of the distributed computing platform such as memory management and computational kernels. At the moment, it has one concrete implementation: CudaProvider. We expect to add OpenCLProvider and possibly others (Xeon Phi, C++/AMP, MPI…) in the future.

DistributedVector and DistributedMatrix represent distributed (i.e. GPU-based) arrays that may or may not have a local copy in CPU memory. Data is copied to and from GPU memory only as needed. The result of, say, multiplying two matrices in GPU memory is a third matrix which also resides on the GPU. If further calculations are done using this matrix, its data is kept on the GPU. It is never copied to CPU memory, unless individual elements are accessed, or an operation is performed that is not supported on the GPU. In the latter case, the data is copied to local memory and the CPU-based implementation is used. We also fall back on the CPU-based code automatically if the GPU runs out of memory, or if there is no CUDA GPU at all.

So what does the code look like? Well, here is a small example that runs 100 iterations of the power method for computing the largest eigenvalue of a matrix using various configurations. The core computation is a matrix-vector product. We use the same code for the CPU-based and the GPU-based calculation.

static double DoPower(Matrix<double> A, Vector<double> b) { 
    double λ = 0;
    Vector<double> temp = null) {
    for (int i = 0; i < imax; i++) {
        temp = Matrix<double>.MultiplyInto(A, b, temp);
        // Note that temp will exist only on the GPU
        // if A and b are GPU arrays
        var λ1 = temp.Norm();
        Vector<double>.MultiplyInto(1 / λ1, temp, b);
        if (Math.Abs(λ1 - λ) < 1e-5) break;
        λ = λ1;
    // In case this is a GPU array: free GPU memory.
    return λ;

// CPU:
l = DoPower(A, b);

// GPU:
l = DoPower(A.MakeDistributed(), b.MakeDistributed());

As you can see, the only difference between the CPU and GPU versions is that we called MakeDistributed on the input arguments.

In our benchmark we added a third variation that touches the matrix on the CPU during each iteration. This forces the matrix to be copied to the GPU in each iteration, which is similar to what happens with naive offloading. Here are the results:

Size: 100
MKL (CPU only):        4.107 ms (lambda=6.07967352075151)
CUDA (keep on GPU):   25.101 ms (lambda=6.07967352075151)
CUDA (offloaded):     29.593 ms (lambda=6.07967352075151)

Size: 500
MKL (CPU only):       42.116 ms (lambda=13.3132987677261)
CUDA (keep on GPU):   30.376 ms (lambda=13.3132987677261)
CUDA (offloaded):     94.250 ms (lambda=13.3132987677261)

Size: 1000
MKL (CPU only):      171.170 ms (lambda=18.754878830699)
CUDA (keep on GPU):   35.196 ms (lambda=18.754878830699)
CUDA (offloaded):    276.329 ms (lambda=18.754878830699)

Size: 5000
MKL (CPU only):    4397.868 ms (lambda=41.3752599052634)
CUDA (keep on GPU): 282.907 ms (lambda=41.3752599052635)
CUDA (offloaded):  5962.417 ms (lambda=41.3752599052635)

This is on a GPGPU-poor GTX 680, with 1/24 double-precision capacity, compared to 1/2 for Titan/Kepler cards, and using Intel’s Math Kernel Library version 11. It clearly shows that using our implementation, the GPU version is competitive for moderate sized matrices (n=500) and really charges ahead for larger problems, while the simple offloading technique never quite catches up.

Using Numerical Libraries from IronPython

Today, we have the pleasure of announcing the availability of a new IronPython interface library and over 50 samples of using our Extreme Optimization Numerical Libraries for .NET from IronPython.

Python is used more and more for numerical computing. It has always been possible to call into it from IronPython. However, IDE support was minimal and some of the more convenient features of Python, like slicing arrays, were not available.

In January, Microsoft announced the availability of Python Tools for Visual Studio. This is a big step forward in IDE’s for Python development on Windows.

Now, with our new IronPython interface library you can take advantage of the following integration features:

  • Create vectors and matrices from Python lists.
  • Setting and getting slices of vectors and matrices.
  • Integrating Python’s complex number type with our DoubleComplex type.
  • Use Python-style format specifiers.

If you want to dive right in, the download is here: IronPython Tools for Extreme Optimization Numerical Libraries for .NET.


In order to use the IronPython interface library, you need the following:


To install the IronPython interface library, follow these steps:

  1. Make sure all the prerequisites are installed.
  2. Download the zip archive containing the IronPython interface library for the Extreme Optimization Numerical Libraries for .NET.
  3. Copy the Extreme.Numerics.IronPython27.dll file from the zip archive to the DLLs folder in the IronPython installation folder.
  4. Copy the IronPython folder from the zip archive to the QuickStart folder in the Extreme Optimization Numerical Libraries for .NET installation folder.

Getting Started

To use the interface library, import the numerics module:

IronPython 2.7.1 ( on .NET 4.0.30319.239 
Type "help", "copyright", "credits" or "license" for more information. 
>>> import numerics 

The main types reside in the Extreme.Mathematics namespace, so it’s a good idea to import everything from it:

>>> from Extreme.Mathematics import *

You can then start using mathematical objects like vectors, and manipulate them:

>>> a = Vector([1,2,3,4,5]) 
>>> b = Vector.Create(5, lambda i: sqrt(i+1)) 
>>> b 
>>> a+b 
>>> from math import * 
>>> a.Apply(sin) 

You can use Python slicing syntax, including counting from the end:

>>> a[0] 
>>> a[-2] 
>>> a[-2:] 
>>> a[1:4] 

Slicing works on matrices as well:

>>> H = Matrix.Create(5,5, lambda i,j: 1.0 / (1+i+j)) 
>>> H 
>>> H[1,1] 
>>> H[1,:] 
>>> H[:,1] 
>>> H[0:5:2,0:5:2] 

Many linear algebra operations are supported, from the simple to the more complex:

>>> H*a 
>>> H.Solve(a) 
>>> svd = H.GetSingularValueDecomposition() 
>>> svd.SingularValues 
Vector([1.5670506910982314, 0.20853421861101323, 0.011407491623419797, 0.00030589804015118552, 3.2879287721734089E-06])

Sample programs

We’ve converted over 60 of our QuickStart samples to Python scripts. The samples folder contains a solution that contains all the sample files. To run an individual sample, find it in Solution Explorer and choose “Set As Startup File” from the context menu. You can then run it in the interactive window by pressing Shift+Alt+F5.