NVIDIA's CUDA is one of the most widely used
GPU computing platforms.
The Extreme Optimization Numerical Libraries for .NET
let you take advantage of CUDA-enabled graphics cards and devices
through its distributed computing framework.
CUDA support is enabled through the
CudaProvider class.
This section only discusses issues specific to CUDA. For general information
on the distributed computing framework, see the previous section on
distributed and GPU computing.
Version 4.0 or higher of the .NET Framework is required in order
to use the CUDA functionality, You also need to have NVIDIA CUDA Toolkit
v5.5 (for 32 bit) or v7.5 (for 64 bit) installed on your machine.
This toolkit can be downloaded from
NVIDIA's website.
To run the software, you need a CUDA-enabled graphics card
with compute capability 1.3 or higher.
Creating CUDA enabled applications
The first step in adding CUDA support to your application is
to add a reference to the CUDA provider assembly for your platform,
Extreme.Numerics.Cuda.Net40.x86.dll or
Extreme.Numerics.Cuda.Net40.x64.dll, to your application.
Next, you need to inform the distributed computing framework that
you are using the CUDA provider:
DistributedProvider.Current =
Extreme.Mathematics.Distributed.CudaProvider.Default;
DistributedProvider.Current =
Extreme.Mathematics.Distributed.CudaProvider.Default
No code example is currently available or this language may not be supported.
DistributedProvider.Current <-
Extreme.Mathematics.Distributed.CudaProvider.Default
Finally, you need to adapt your code to use distributed arrays where appropriate.
The guidelines for working with distributed arrays from the previous section
apply to CUDA code as well.
CUDA-specific functionality
The CUDA provider exposes a number of functions specific
to the CUDA environment:
Method | Description |
---|
GetAvailableMemory |
Returns the free memory available on the device, in bytes. Note that because of memory fragmentation,
it is unlikely that a block of this size can be allocated.
|
GetTotalMemory |
Returns the total memory on the device, in bytes.
|
GetDeviceLimit(Int32) |
Wrapper for the cudaDeviceGetLimit function.
|
The GetAvailableMemory
method is particularly useful for verifying that all device memory has been properly released.
Inter-operating with other CUDA libraries.
The CUDA provider supplies a large number of functions that are optimized
for use on CUDA GPU's. Sometimes it is necessary to call into external libraries.
This section outlines how to do this.
A pointer to device memory can be obtained from a distributed array through the
NativeStorage
property for vectors, and
NativeStorage.
These methods return a storage structure that has two relevant fields.
For vectors, the Values
property is an IntPtr
that points to the start of the memory block that contains the data for the vector.
The Offset
is the number of elements (not bytes)
from the start of the memory block where the first element in the vector is stored.
This information can be combined to get the starting address for the vector's elements.
Storage for vectors may not be contiguous. This can happen, for example,
when the vector represents a row in a matrix.
The Stride
property specifies the number of elements between vector elements. A value of 1
corresponds to contiguous storage.
For matrices, the Values
property is an IntPtr
that points to the start of the memory block that contains the data for the vector.
The Offset
is the number of elements (not bytes)
from the start of the memory block where the first element in the vector is stored.
This information can again be combined to get the starting address for the vector's elements.
Matrices are stored in column-major order. This means that columns are stored contiguously.
It is possible that not all elements in a matrix are contiguous.
The LeadingDimension
property specifies the number of elements between the start of each column. This is usually equal
to the number of rows in the matrix, but not always.
Once the device addresses of the data have been obtained, they can be passed to
an external function. If this function modifies the values of an array,
this should be signaled by invalidating the array's local data. Otherwise,
an outdated local copy of the data may be used when retrieving the results.
This can be done with a call to
Invalidate(DistributedDataLocation).
The CUDA provider has an overloaded
Copy
method that can copy from device to host, host to device, and device to device.