Distributed and GPU Computing

By default, all calculations done by the Extreme Optimization Numerical Libraries for .NET are performed by the CPU. In this section, we describe how calculations can be offloaded to a GPU or a compute cluster. Currently only CUDA GPU's are supported.

In what follows, the term distributed refers to an object or action on a device, cluster or node other than the main CPU and its associated memory. Local refers to an object or action on the CPU and its associated memory.

All types that implement the distributed computing framework live in the Extreme.Mathematics.Distributed namespace.

Distributed providers

The core object in the distributed computing framework is the distributed provider. It provides core functionality for a specific distributed computing platform, like memory and device management.

Each distributed computing platform has its own provider. This is an object that inherits from DistributedProvider. It has a static Current property that should be set to an instance of the provider.

Distributed arrays

Distributed arrays are vectors and matrices that can be used in distributed computations. There are two types: DistributedVector<T> and DistributedMatrix<T>. They act just like normal vectors and matrices, except that calculations involving these arrays are done on a device or cluster.

Distributed arrays have their data on device memory. They may also have a local copy in CPU memory. Data is transferred between local and distributed memory only as necessary. The elements of a distributed array can still be accessed from CPU-based code. However, this may be expensive if the data has changed. In that case, the entire array must be transferred between distributed and local memory.

Distributed memory is an unmanaged resource. As such, the Dispose method should be called on distributed arrays to make sure that distributed memory is properly released.

Creating distributed arrays

Distributed arrays can be created in three ways:

  1. A normal array that lives in CPU memory can be converted to a distributed array.

  2. A new distributed array can be created directly on the device without creating a local copy.

  3. The result of a calculation that involves distributed arrays is a new distributed array that has no local copy.

To convert a local array to a distributed array, use the provider's MakeDistributed method. Alternatively, the MakeDistributed extension method can be called directly on the array. This will use the current distributed provider, so it always works if you have only one distributed provider.

The data isn't copied to the device immediately. It is done when needed, or when you explicitly call the Distribute method.

To create a distributed array without creating a local copy, use the provider's CreateVector<T> or CreateMatrix<T> method. The parameters are the desired length of the vector, and the desired number of rows and columns of the matrix, respectively. Distributed memory is allocated immediately, and will throw an exception if allocation fails.

You can access parts of a distributed array by calling methods like GetSlice or GetRow. No data is transferred during this operation. If the array does not have a local copy, then the sub-array will not have a local copy, either. Note, however, that when data is copied, the entire original array is copied, not just the sub-array.

Operations on distributed arrays

Most matrix and vector operations can be performed directly on distributed arrays. The exact details depend on the provider. If an operation cannot be performed in a distributed way, any distributed data is copied locally, and the calculation is done locally instead.

For binary operations, whenever one of the operands is a distributed array, the entire calculation is attempted on distributed arrays.

In general, it is a good idea to specify the result array in the expression. This can be done by using the Into version of an operation, for example AddInto. This can greatly reduce the number of temporary arrays created during a calculation, and also helps ensure that distributed memory is properly released. Make sure all temporary arrays are disposed when they are no longer needed.

Example: The Power Method

Here is an example that shows all the essential elements. It runs 100 iterations of the power method for computing the largest eigenvalue of a matrix using CUDA. The core computation is a matrix-vector product. The same code can be used for the CPU-based and the GPU-based calculations.

C#
static double DoPower(Matrix<double> A, Vector<double> b)
{
    double λ = 0;
    b.SetValue(1.0);
    Vector<double> temp = null;
    int maxIterations = 1000;
    for (int i = 0; i < maxIterations; i++)
    {
        temp = Matrix<double>.MultiplyInto(A, b, temp);
        // Note that temp will exist only on the GPU
        // if A and b are GPU arrays
        var λ1 = temp.Norm();
        Vector.MultiplyInto(1 / λ1, temp, b);
        if (Math.Abs(λ1 - λ) < 1e-5) break;
        λ = λ1;
    }
    // In case this is a GPU array: free GPU memory.
    temp.Dispose();
    return λ;
}

public static void Run()
{
    int size = 2000;
    var A = Matrix.CreateRandom(size, size);
    var b = Vector.CreateRandom(size);

    // CPU:
    var l = DoPower(A, b);

    // GPU:
    var dA = A.MakeDistributed();
    var db = b.MakeDistributed();
    l = DoPower(dA, db);
    dA.Dispose();
    db.Dispose();
}

The only difference between the CPU and GPU versions is that MakeDistributed was called on the input arguments.