By default, all calculations done by the
Extreme Optimization
Numerical Libraries for .NET
are performed by the CPU.
In this section, we describe how calculations can be offloaded to
a GPU or a compute cluster. Currently only CUDA GPU's are supported.
In what follows, the term distributed refers to
an object or action on a device, cluster or node other than the
main CPU and its associated memory.
Local refers to an object or action on the CPU
and its associated memory.
All types that implement the distributed computing framework
live in the Extreme.Mathematics.Distributed
namespace.
The core object in the distributed computing framework is the
distributed provider. It provides core functionality for a specific
distributed computing platform, like memory and device management.
Each distributed computing platform has
its own provider. This is an object that inherits from
DistributedProvider.
It has a static
Current
property that should be set to an instance of the provider.
Distributed arrays are vectors and matrices that can be used
in distributed computations. There are two types:
DistributedVectorT
and
DistributedMatrixT.
They act just like normal vectors and matrices, except that calculations
involving these arrays are done on a device or cluster.
Distributed arrays have their data on device memory.
They may also have a local copy in CPU memory.
Data is transferred between local and distributed memory
only as necessary. The elements of a distributed array
can still be accessed from CPU-based code. However,
this may be expensive if the data has changed.
In that case, the entire array must be transferred
between distributed and local memory.
Distributed memory is an unmanaged resource. As such,
the Dispose
method should be called on distributed arrays to make sure
that distributed memory is properly released.
Creating distributed arrays
Distributed arrays can be created in three ways:
A normal array that lives in CPU memory
can be converted to a distributed array.
A new distributed array can be created directly on the device
without creating a local copy.
The result of a calculation that involves distributed arrays
is a new distributed array that has no local copy.
To convert a local array to a distributed array, use the provider's
MakeDistributed
method. Alternatively, the
MakeDistributed
extension method can be called directly on the array.
This will use the current distributed provider, so it always works if you have only one distributed provider.
The data isn't copied to the device immediately. It is done when needed, or when
you explicitly call the
Distribute
method.
To create a distributed array without creating a local copy, use the provider's
CreateVectorT
or
CreateMatrixT
method. The parameters are the desired length of the vector, and the desired number of rows and columns
of the matrix, respectively. Distributed memory is allocated immediately,
and will throw an exception if allocation fails.
You can access parts of a distributed array by calling methods like
GetSlice
or
GetRow.
No data is transferred during this operation.
If the array does not have a local copy, then the sub-array will not have a local copy, either.
Note, however, that when data is copied,
the entire
original array is copied, not just the sub-array.
Operations on distributed arrays
Most matrix and vector operations can be performed directly on distributed arrays.
The exact details depend on the provider. If an operation cannot be performed
in a distributed way, any distributed data is copied locally, and the calculation is
done locally instead.
For binary operations, whenever one of the operands is a distributed array,
the entire calculation is attempted on distributed arrays.
In general, it is a good idea to specify the result array in the expression.
This can be done by using the Into version of an operation,
for example
AddInto.
This can greatly reduce the number of temporary arrays created during a calculation,
and also helps ensure that distributed memory is properly released.
Make sure all temporary arrays are disposed when they are no longer needed.
Example: The Power Method
Here is an example that shows all the essential elements.
It runs 100 iterations of the power method for computing
the largest eigenvalue of a matrix using CUDA.
The core computation is a matrix-vector product.
The same code can be used for the CPU-based and
the GPU-based calculations.
static double DoPower(Matrix<double> A, Vector<double> b)
{
double λ = 0;
b.SetValue(1.0);
Vector<double> temp = null;
int maxIterations = 1000;
for (int i = 0; i < maxIterations; i++)
{
temp = Matrix<double>.MultiplyInto(A, b, temp);
var λ1 = temp.Norm();
Vector.MultiplyInto(1 / λ1, temp, b);
if (Math.Abs(λ1 - λ) < 1e-5) break;
λ = λ1;
}
temp.Dispose();
return λ;
}
public static void Run()
{
int size = 2000;
var A = Matrix.CreateRandom(size, size);
var b = Vector.CreateRandom(size);
var l = DoPower(A, b);
var dA = A.MakeDistributed();
var db = b.MakeDistributed();
l = DoPower(dA, db);
dA.Dispose();
db.Dispose();
}
Shared Function DoPower(A As Matrix(Of Double), b As Vector(Of Double)) As Double
Dim λ As Double = 0
b.SetValue(1.0)
Dim imax = 1000
Dim temp As Vector(Of Double) = Nothing
For i As Integer = 0 To imax
temp = Matrix(Of Double).MultiplyInto(A, b, temp)
Dim λ1 As Double = temp.Norm()
Vector.MultiplyInto(1 / λ1, temp, b)
If (Math.Abs(λ1 - λ) < 0.00001) Then Exit For
λ = λ1
Next
temp.Dispose()
Return λ
End Function
Public Shared Sub Run()
Dim size = 2000
Dim A = Matrix.CreateRandom(size, size)
Dim b = Vector.CreateRandom(size)
Dim l = DoPower(A, b)
Dim dA = A.MakeDistributed()
Dim db = b.MakeDistributed()
l = DoPower(dA, db)
dA.Dispose()
db.Dispose()
End Sub
No code example is currently available or this language may not be supported.
let DoPower (A : Matrix<float>) (b : Vector<float>) =
let imax = 1000
b.SetValue(1.0) |> ignore
let rec iterate λ0 x0 i =
match i with
| _ when i < imax ->
let x =
match x0 with
| None -> Matrix.Multiply(A, b)
| Some result -> Matrix.MultiplyInto(A, b, result)
let λ = x.Norm()
Vector.MultiplyInto(1.0 / λ, x, b) |> ignore
if (Math.Abs(λ - λ0) < 1e-5) then
x.Dispose()
λ
else
iterate λ (Some x) (i+1)
| _ -> λ0
iterate 0.0 None 0
let run =
let size = 2000
let A = Matrix.CreateRandom(size, size)
let b = Vector.CreateRandom(size)
let l = DoPower A b
let dA = A.MakeDistributed()
let db = b.MakeDistributed()
let l = DoPower dA db
dA.Dispose()
db.Dispose()
The only difference between the CPU and GPU versions is that
MakeDistributed
was called on the input arguments.