Summary of CUDA Support

Using the X10/CUDA backend, one can identify fragments of an X10 program to run on the GPU.  For ideal workloads, this can give a speedup of up to 30x or more.  Any X10RT backend can be used with CUDA, but X10 must be built specially from source to allow this capability.

The idea behind the X10/CUDA implementation is to expose the low level CUDA fundamentals in as direct a fashion as possible, to maximize the ways in which the backend can be used, and to present as few surprises as possible to programmers.  To support this, we have also striven to change X10 minimally, and 99% of the semantics of CUDA are represented with certain design patterns of X10 language features.  For this reason, one needs to have a reasonably good understanding of the distributed/multicore features of the X10 language (places, async, finish, and the distributed object model, which are together commonly known as the APGAS model) before programming CUDA kernels in X10.

For a detailed technical description of how X10 is compiled to CUDA to run on GPUs, please see

Prerequisites

You will need a CUDA-capable NVidia GPU.  All recent NVidia GPUs support this.  Anything from the '8000' series should work.  Check the documentation for your hardware to be sure.  Lists of supported chipsets can be found on the NVidia website.  If the specifications for your card include the number of "CUDA cores" available, then you can be sure it supports CUDA.

You will need to install graphics drivers that support CUDA.  We have tested with the latest versions of the drivers, so you should also use these if possible.  Download CUDA drivers from the NVidia website https://developer.nvidia.com/cuda-toolkit-archive

You will also need the NVidia CUDA compiler, 'nvcc', and its runtime libraries.  These are part of the 'toolkit' download from the same site.  You should make sure nvcc is included your path, and the toolkit is installed in /usr/local/cuda.  If x10c++ cannot find nvcc, it will print a warning message and produce an executable that will only run on the CPU.  If you try to run it on your GPU you will get a message about not being able to find the cubin files, since these are output by nvcc. 

Compiling

Unpack the source release tarball into a directory of your choice.  Invoke ant as follows to build the whole X10 compiler and runtime:

cd x10.dist
ant dist -DX10RT_CUDA=true -Doptimize=true

The build process will expect to find the CUDA toolkit headers and libraries installed on the local system in /usr/local/cuda.

If you want to use a combination of CUDA and MPI (e.g. for an InfiniBand cluster of hosts with GPUs attached), include the following extra argument:

-DX10RT_MPI=true

If you want to use gdb or valgrind to debug your generated programs, you might want to build the X10 runtime without optimization.  To reduce optimization, disable GC (which interferes with valgrind) and enable extra assertions, unpack the tarball into a different directory and build it as follows (again, include the extra MPI argument if needed):

cd x10.dist
ant dist -DX10RT_CUDA=true -Doptimize=false -DDISABLE_GC=true

Note that your application will be built with optimization if and only if you give the -O argument to x10c++.  The optimization argument above only affects the optimization of the X10 runtime.  It is possible, and sometimes useful, to have an unoptimized application linked against an optimized X10 runtime, and vice versa.

Testing The Build with The CUDA Samples

Compiling

In the x10.dist/samples/cuda directory there are 5 CUDA-capable X10 programs you can now try.  Change to that directory and compile each one as follows:

../../bin/x10c++ -O -STATIC_CHECKS CUDATopology.x10 -o CUDATopology
../../bin/x10c++ -O -STATIC_CHECKS CUDABlackScholes.x10 -o CUDABlackScholes
../../bin/x10c++ -O -STATIC_CHECKS KMeansCUDA.x10 -o KMeansCUDA
../../bin/x10c++ -O -STATIC_CHECKS CUDA3DFD.x10 -o CUDA3DFD
../../bin/x10c++ -O -STATIC_CHECKS CUDAMatMul.x10 -o CUDAMatMul

If you want to use MPI, you should add the following argument to x10c++ in each case:

-x10rt mpi

The -O parameter is optional here, but will give you the best performance for the parts of the code that run on the host. The -STATIC_CHECKS parameter is optional, but recommended -- it ensures the X10 compiler will not inject dynamic constraint checks into the kernel, which will result in nvcc compilation errors. You may optionally give x10c++ the -NO_CHECKS argument to suppress generation of array bounds, null pointer, and divide by zero checks for the host code; these checks are always disabled even without -NO_CHECKS for the version of the kernel that runs on the GPU.

These builds will try and execute nvcc using the $PATH, so please ensure it can be found.

Executing

All of the compiled samples above should be runnable just like normal X10 programs.  If one used -x10rt mpi then one should use mpirun to execute them in the ordinary way for MPI programs.

However, the applications will not use GPUs by default.  To make them use your local GPUs, define the environment variable X10RT_ACCELS, like this:

X10RT_ACCELS=ALL ./CUDATopology
X10RT_ACCELS=ALL ./CUDABlackScholes
X10RT_ACCELS=ALL ./KMeansCUDA -i 50
X10RT_ACCELS=ALL ./CUDA3DFD
X10RT_ACCELS=ALL ./CUDAMatMul

This allows all GPUs to be used by each X10 host place.  In the above example, only one host place exists because we did not specify X10_NPLACES.  If you ran with mpirun then each MPI process looks at X10RT_ACCELS to choose how many GPUs to use.  You will probably have to look into your mpirun documentation to see how to make it pass environment variables to the MPI processes.

If you run these samples without the environment variable set, that is equivalent to setting it to 'NONE', and no GPUs will be used.  You should see the CUDABlackScholes and KMeansCUDA samples are running a lot slower in this case because they are written to use the host CPU for the computation if they do not find any GPUs available.

  • The CUDATopology sample just prints the GPU places available underneath every host place, so its output will change depending on the value of X10RT_ACCELS.
  • CUDABlackScholes is a reimplementation of the NVidia BlackScholes CUDA demo.  It runs with similar performance to the NVidia version.
  • KMeansCUDA is an implementation of KMeansSPMD.x10, in the samples directory, that uses the GPU for acceleration.  The -v parameter will make it print out the clusters at each iteration so you can watch them converge.  It only works correctly when the number of clusters is a multiple of 20.
  • CUDA3DFD is a port of the 3DFD benchmark in the NVidia CUDA SDK.  On the GPU it calculates the finite differences of a 3d volume and verifies the result against a CPU computation of the same.
  • CUDAMatMul is a port of one of Volkov's dense matrix multiplication kernel.

You can also specify a comma separated list of GPUs to use, and can oversubscribe each GPU several times.  For instance, the following value for X10RT_ACCELS will use the first GPU 3 times and the third GPU twice:

X10RT_ACCELS=CUDA0,CUDA0,CUDA0,CUDA2,CUDA2

If the X10 package was compiled without support for CUDA, then there will always be no GPUs discovered at runtime, and X10RT_ACCELS will be silently ignored.

Writing Your Own CUDA Kernels

The X10/CUDA design utilizes existing X10 language features to express the various concepts defined by the CUDA system.  The code that you write is therefore much like normal X10 code, but certain patterns are enforced because GPUs are not as capable as ordinary CPUs.

CUDA Places

A single GPU has its own memory that is distinct from the host memory and from other GPUs.  Therefore we represent it with a place.  In fact one can oversubscribe GPUs using the X10RT_ACCELS environment variable, and in this case many places will be created on the same GPU, but they will be distinct in terms of X10 semantics.

The Place API contains methods that can be used for finding out what GPUs are available, which in turn depends on the value of X10RT_ACCELS when the program was executed.  In particular, isCUDA() will identify whether or not a particular place is a GPU, and parent() will return the host of a GPU.  For a given place p, one can iterate over PlaceTopology.getTopology().getChildren(p) in order to get the GPUs at that place.  Also, PlaceTopology.getTopology().numChildren(p) will return the number of children.  Any GPU program ought to have a strategy for when there are no GPUs present, perhaps choosing to run on the CPU instead.

While the API contains calls that suggest some places may be cell SPEs, there is currently no support for compiling X10 programs to run on the cell.  However since it is not unlikely that other kinds of accelerators will be supported in future (e.g. OpenCL) we will not guarantee that every child of a place is a CUDA GPU.

Memory On CUDA Places

Just like during the execution of any X10 program, each place has its own heap and objects live in a particular place.  Unlike normal X10 programs, CUDA requires that GPU memory is allocated and managed by the host place, rather than the GPU's own place.  Therefore we provide some utility functions to represent this kind of remote memory management.  Each of these utility functions also works if the remote place is a CPU instead of a GPU, since under the covers it is just using normal X10 language features to implement the needed semantics.

Allocating GPU Memory

The main datatype for doing work on GPUs is Rail.  However it cannot be made in the usual form because allocation in CUDA kernels is not allowed:

val remote_gpu_array = at (gpu) new Rail[T](sz, init); // will not work

Instead, we provide the following function to call from the host, in the class x10.util.CUDAUtilities, whose semantics is the same as the above:

val remote_gpu_array = CUDAUtilities.makeGlobalRail[Float](gpu, sz, init);

As with ordinary array construction, init can be a closure, a single value to initialise all elements, or an existing array of the same kind.

To free the array, since there is no garbage collection for CUDA objects yet, call the following:

CUDAUtilities.deleteGlobalRail(remote_gpu_array);

Copying Values Between GPU and Host Places.

The usual X10 mechanisms for copying arrays can be used to copy to/from arrays in GPU memory. Note that the following is asynchronous so should be wrapped in a finish statement to ensure it has completed.

Rail.asyncCopy(src_array, src_offset, dst_array, dst_offset, len);

Since X10 is a garbage collected language, we cannot allocate all our host arrays using cudaAllocHost.  This means the arrays will not be pinned, so we are forced to spool the data through an intermediate pinned buffer (hidden in the X10 runtime).  Because of this spooling, the bandwidth of DMAs takes up to a 30% hit.  This is a known limitation of the design of CUDA, and has nothing to do with X10.

The size of the spooling buffer can be specified in bytes with the environment variable X10RT_CUDA_DMA_SLICE, and defaults to 1MB.  By increasing this, one can often reduce the overhead of the spooling.

Kernels

To run X10 code on the GPU, a particular pattern of X10 language constructs must be used.  Since the GPU is a remote place, we have to use the 'at' construct to run code there.  And since the underlying CUDA kernels are asynchronous, we currently require the form to be:

async at (gpu) {  ...  }

In future we will also allow "at (gpu) { ... }" but at present it is necessary to use "finish async at (gpu) { ... }" to synchronize on the completion of a kernel.

Since only certain kinds of code can be executed on the GPU, there is an annotation @CUDA, found in the package x10.compiler, that must be used to inform the compiler that a certain block of code is intended to run on the GPU.  Since places are runtime quantities, it is not possible to statically know that the value of the 'gpu' variable is actually a GPU place and not just another host.  So we use the @CUDA annotation to cause the block of code to be compiled for both CPU and GPU, instead of just for the CPU like the rest of the X10 program.  In fact this behaviour is useful, as it allows CUDA code to run on the host if no GPUs can be found, through the usual X10 generated code.

async at (gpu) @CUDA {  ...  }

Once you add the @CUDA annotation, the compiler will insist that certain restrictions are obeyed within the annotated block.  It will also trigger the compiler to attempt to run 'nvcc' to generate a cubin file for each class that contains an occurrence of @CUDA.  If you attempt to run code on a GPU place, and that code has not been marked with @CUDA, you will get a runtime error of the form "X10RT: async id X is not a CUDA kernel".

Kernel Structure

The basic execution pattern of a CUDA kernel is that there are a number of 'blocks', each of which spawn a number of 'threads', and each thread executes the same code.  We represent this using X10 language constructs as follows:

async at (gpu) @CUDA {
    finish for (block in 0n..239n) async {
        clocked finish for (thread in 0n..63n) clocked async {
            ...
        }
    }
}

Note the 'async' in the central for loop.  The '...' is what one would typically call 'the CUDA kernel', as it is the code that each thread runs in parallel.  The outer async is to specify the fact that CUDA blocks also run in parallel.  Naturally, the code denoted with ... may use the loop variables 'block' and 'thread' (which can be named arbitrarily).  The 'clocked' qualifiers on the inner finish and async allow the use of Clock.advanceAll() call within the kernel.  This is compiled down to the __syncthreads() intrinsic.  In future we will allow the omission of the 'clocked' qualifiers and in this case the use of Clock.advanceAll() call will not be permitted.  Other variations of clocks within the kernel are not permitted.  We also allow the total number of threads and number of blocks to be specified using variables captured from the enclosing scope, as so:

async at (gpu) @CUDA {
    val blocks = 240n;
val threads = 64n;
finish for (block in 0n..(blocks-1n)) async { clocked finish for (thread in 0n..(threads-1n)) clocked async { ... } } }

As in normal X10 programs, the code in '...' can access variables from the enclosing scope, i.e. the host.  Such captured variables will be automatically copied to the GPU.  Entire arrays will be copied, which can be slow so if the array is large enough and doesn't change from one kernel to another, it may be better to create a remote array initially, using CUDAUtilities.makeGlobalRail(...).  In this case, the kernel would capture the pointer to this specially created array instead of capturing a local array on the host.  One must consider pre-allocating remote arrays when optimizing performance of regular X10 programs too, as the copying behaviour of captured arrays is the same for GPUs and CPUs.

Stack variables inside the '...' are compiled to use CUDA registers.

Limitations

There are currently many constructs that are not supported in GPU code:

  • structs
  • method calls (except in special cases, e.g. Rail.apply and primitive arithmetic)
  • new
  • creating / calling closures
  • up/down casts or instanceof tests
  • throwing exceptions
  • catching exceptions
  • spawning more asyncs or using finish
  • await blocks
  • atomic blocks

Some of these we plan to address in future versions (e.g. method calls, structs).  However there are also fundamental limitations of CUDA hardware and runtimes that make it difficult for us to implement e.g. new, async, or dynamic dispatch for closures and method calls.  In the latest CUDA GPUs, some of these issues have been addressed so it is hard to say with certainty what we will and will not allow in future.  It is also unclear how many of these features will be useful for writing kernels, and what ultimately comprises an appropriate and well-rounded subset of language features.

Kernel Parameters

Variables defined outside of the kernel and used within the kernel (captured variables) are the X10/CUDA equivalent of CUDA kernel parameters.  There are two implementation strategies we use for implementing these kernel parameters.  The default strategy is to create a struct containing all the variables and DMA this to the GPU before executing the kernel.  This should always work, but is less efficient than the other technique, which is to directly use CUDA kernel parameters to hold the captured environment.  In the latter case, if the environment is too large, it will not fit in the small area designated by CUDA for transferring parameters to the kernel.

The X10 programmer can currently control which mechanism is used, although in future we will automatically decide based on the size of the environment.  To use CUDA kernel parameters instead of a separate DMA, add the annotation @CUDADirectParams after the @CUDA annotation.  This annotation is also found in the package x10.compiler.

Shared Memory and Constant Memory

Shared memory is a CUDA concept for memory that threads within a block share and can read/write to.  This is distinct from registers, that are local to each individual thread.  In KMeansCUDA.x10, shared memory is used as a cache, since global memory fetches are slow in CUDA.  In CUDA3DFD.x10 and CUDAMatMul.x10 it is use as a cache which is updated each iteration, effectively implementing a staging ground through which to stream data from the much slower global memory.

Constant memory is a CUDA concept for memory that is populated before a kernel runs and is immutable for the duration of that kernel.  Every block and thread reads from the same data.  The hardware uses a cache for the constant memory that in many cases makes it as fast to access as local registers or shared memory.

In X10, we express these semantics by defining heap objects in specific places:

async at (gpu) @CUDA {
    val cmem = CUDAConstantRail(external_array);
    finish for (block in 0n..63n) async {
        val shmem = new Rail[Float](240, init);
        clocked finish for (thread in 0n..239n) clocked async {
            ...
            shmem(thread) = thread;
            ...
            Clock.advanceAll();
            ...
            val tmp = shmem((thread+1) % 240);
            ...
        }
    }
}

Because 'cmem' is immutable, we represent it with an instance the CUDAConstantRail object, which is constructed from an existing array on the host.  A CUDAConstantRail object allows random access to the array, but not updates.

Since 'shmem' is mutable we use an array which allows both random access and updates.  However updates to shmem need to be synchronized between threads using the Clock.advanceAll() call, just like one would use __syncthreads() to synchronize shared memory in a native CUDA program.

In both cases, since the object is not actually allocated on the heap, but in shared / constant memory that does not outlive the CUDA kernel, it is important not to leak the 'shmem' or 'cmem' reference by writing it to memory that will be accessed in a later kernel.

Limitations

  • Only Rails of Int / Float may be specified in shared memory
  • Only CUDAConstantRails of Int / Float may be specified in constant memory
  • The requirement that shared / constant memory objects not be leaked is not yet enforced.
  • The constant memory is repopulated each kernel iteration.  We do not yet support the idiom of populating the constant memory once, and then running many kernel invocations.

We plan to address these limitations before the next release.

Auto Blocks / Threads

The performance of CUDA kernels is very sensitive to the number of blocks/threads used, the particulars of the kernel code (e.g. the number of registers / shared memory used) and the hardware that the code runs on.  This makes it very difficult to write portable code, since there is such variety in GPUs.  To help in this situation, we provide a feature that will automatically choose blocks and threads following a strategy that seems to work well in most cases.  The idea is the maximize utilization while using a multiple of 64 threads, and as few blocks/threads as possible.  Utilization is a CUDA concept meaning the number of hardware thread slots that are occupied.

In order to use this functionality, you may optionally declare two variables in the following manner, within the @CUDA annotation, and use these variables to define the extents of the loops:

async at (gpu) @CUDA {
    val blocks = CUDAUtilities.autoBlocks();
    val threads = CUDAUtilities.autoThreads();
    finish for (block in 0n..(blocks-1n)) async {
        clocked finish for (thread in 0n..(threads-1n)) clocked async {
            ...
        }
    }
}

The kinds of kernel that can make use of this are kernels whose correctness is not sensitive to the number of blocks/threads, and whose shared memory requirements do not depend on the number of blocks or threads, since the amount of shared memory required is actually used to determine how many blocks/threads there will be.

The exact algorithm used is an iteration down a list of pairs of (blocks,threads), with preferred pairs (with greater utilization for a given MP) at the beginning of the list.  The selected pair is the first pair that is a valid way to instantiate the kernel in question on the GPU in question.  The selected pair is used to spawn the kernel, except that the number of blocks are scaled by the number of MPs (what NVidia calls 'streaming processors' but would conventionally be called cores) in that GPU.  Pairs will be stepped over if they require too many registers, too much shared memory, etc.  The list contains 32 pairs and the curious can have a look at it in x10.runtime/src-cpp/x10aux/network.cc to see its definition.

Performance Notes

There is considerable performance transparency with the current implementation.  However, this also means we have not attempted to hide any of the performance artefacts that are peculiar to CUDA devices and the CUDA programming model.  Programmers need to be aware of coalesced memory accesses and bank conflicts, and have to choose the number of blocks/threads wisely.  The NVidia CUDA profiler, accessible with the CUDA_PROFILE and CUDA_PROFILE_CONFIG environment variables, is very useful to debug performance problems within X10/CUDA programs.  The X10/CUDA programmer is best equipped to deal with performance issues on the GPU having read the relevant parts of the CUDA manual where the performance model is discussed in detail.