By default, all calculations done by the
Numerical Libraries for .NET
are performed by the CPU.
In this section, we describe how calculations can be offloaded to
a GPU or a compute cluster. Currently only CUDA GPU's are supported.
In what follows, the term distributed refers to
an object or action on a device, cluster or node other than the
main CPU and its associated memory.
Local refers to an object or action on the CPU
and its associated memory.
All types that implement the distributed computing framework
live in the Extreme.Mathematics.Distributed
The core object in the distributed computing framework is the
distributed provider. It provides core functionality for a specific
distributed computing platform, like memory and device management.
Each distributed computing platform has
its own provider. This is an object that inherits from
It has a static
property that should be set to an instance of the provider.
Distributed arrays are vectors and matrices that can be used
in distributed computations. There are two types:
They act just like normal vectors and matrices, except that calculations
involving these arrays are done on a device or cluster.
Distributed arrays have their data on device memory.
They may also have a local copy in CPU memory.
Data is transferred between local and distributed memory
only as necessary. The elements of a distributed array
can still be accessed from CPU-based code. However,
this may be expensive if the data has changed.
In that case, the entire array must be transferred
between distributed and local memory.
Distributed memory is an unmanaged resource. As such,
method should be called on distributed arrays to make sure
that distributed memory is properly released.
Creating distributed arrays
Distributed arrays can be created in three ways:
A normal array that lives in CPU memory
can be converted to a distributed array.
A new distributed array can be created directly on the device
without creating a local copy.
The result of a calculation that involves distributed arrays
is a new distributed array that has no local copy.
To convert a local array to a distributed array, use the provider's
method. Alternatively, the
extension method can be called directly on the array.
This will use the current distributed provider, so it always works if you have only one distributed provider.
The data isn't copied to the device immediately. It is done when needed, or when
you explicitly call the
To create a distributed array without creating a local copy, use the provider's
method. The parameters are the desired length of the vector, and the desired number of rows and columns
of the matrix, respectively. Distributed memory is allocated immediately,
and will throw an exception if allocation fails.
You can access parts of a distributed array by calling methods like
No data is transferred during this operation.
If the array does not have a local copy, then the sub-array will not have a local copy, either.
Note, however, that when data is copied,
original array is copied, not just the sub-array.
Operations on distributed arrays
Most matrix and vector operations can be performed directly on distributed arrays.
The exact details depend on the provider. If an operation cannot be performed
in a distributed way, any distributed data is copied locally, and the calculation is
done locally instead.
For binary operations, whenever one of the operands is a distributed array,
the entire calculation is attempted on distributed arrays.
In general, it is a good idea to specify the result array in the expression.
This can be done by using the Into version of an operation,
This can greatly reduce the number of temporary arrays created during a calculation,
and also helps ensure that distributed memory is properly released.
Make sure all temporary arrays are disposed when they are no longer needed.
Example: The Power Method
Here is an example that shows all the essential elements.
It runs 100 iterations of the power method for computing
the largest eigenvalue of a matrix using CUDA.
The core computation is a matrix-vector product.
The same code can be used for the CPU-based and
the GPU-based calculations.
static double DoPower(Matrix<double> A, Vector<double> b)
double λ = 0;
Vector<double> temp = null;
int maxIterations = 1000;
for (int i = 0; i < maxIterations; i++)
temp = Matrix<double>.MultiplyInto(A, b, temp);
var λ1 = temp.Norm();
Vector.MultiplyInto(1 / λ1, temp, b);
if (Math.Abs(λ1 - λ) < 1e-5) break;
λ = λ1;
public static void Run()
int size = 2000;
var A = Matrix.CreateRandom(size, size);
var b = Vector.CreateRandom(size);
var l = DoPower(A, b);
var dA = A.MakeDistributed();
var db = b.MakeDistributed();
l = DoPower(dA, db);
Shared Function DoPower(A As Matrix(Of Double), b As Vector(Of Double)) As Double
Dim λ As Double = 0
Dim imax = 1000
Dim temp As Vector(Of Double) = Nothing
For i As Integer = 0 To imax
temp = Matrix(Of Double).MultiplyInto(A, b, temp)
Dim λ1 As Double = temp.Norm()
Vector.MultiplyInto(1 / λ1, temp, b)
If (Math.Abs(λ1 - λ) < 0.00001) Then Exit For
λ = λ1
Public Shared Sub Run()
Dim size = 2000
Dim A = Matrix.CreateRandom(size, size)
Dim b = Vector.CreateRandom(size)
Dim l = DoPower(A, b)
Dim dA = A.MakeDistributed()
Dim db = b.MakeDistributed()
l = DoPower(dA, db)
No code example is currently available or this language may not be supported.
let DoPower (A : Matrix<float>) (b : Vector<float>) =
let imax = 1000
b.SetValue(1.0) |> ignore
let rec iterate λ0 x0 i =
match i with
| _ when i < imax ->
let x =
match x0 with
| None -> Matrix.Multiply(A, b)
| Some result -> Matrix.MultiplyInto(A, b, result)
let λ = x.Norm()
Vector.MultiplyInto(1.0 / λ, x, b) |> ignore
if (Math.Abs(λ - λ0) < 1e-5) then
iterate λ (Some x) (i+1)
| _ -> λ0
iterate 0.0 None 0
let run =
let size = 2000
let A = Matrix.CreateRandom(size, size)
let b = Vector.CreateRandom(size)
let l = DoPower A b
let dA = A.MakeDistributed()
let db = b.MakeDistributed()
let l = DoPower dA db
The only difference between the CPU and GPU versions is that
was called on the input arguments.