This web page presents a quick summary of some key issues in performance tuning X10 applications with the goal of helping users understand and improve the performance of applications written in X10 2.2. Significantly more detail on this topic can be found in two papers presented at the X10'11 workshop at PLDI. We strongly recommend that you at least skim these papers if you are interested in performance tuning X10 applications.
- A Performance Model for X10 Applications presents a general discussion of the X10 performance model and details of the Native X10 implementation (X10 compiled to C++)
- Compiling X10 to Java contains an in depth discussion of Managed X10 performance (X10 compiled to Java).
The rest of this page covers the following topics:
- Best Practices for Performance Tuning
- Compiler and build options to maximize performance
- Configuring the X10 Runtime
- Selecting the right X10RT Implementation
- Tips and Tricks for Performance Analysis of X10 programs
- Implementation Limitations and Other Pitfalls
Best Practices for Performance Tuning
Functionality first, Performance second
It's always advisable to first get the code working and thoroughly tested first, then worry about making it perform.
Use the right compiler flags (and the right compiler).
Be sure to compile your program with -O. Compile with -O -NO_CHECKS for peak performance once you are sure that the code is correct. For most programs, you will get better performance with x10c++, not x10c. The main exception are programs that are heavily object-oriented and/or have a very high allocation rate.
Worry about single-place performance first
Our experience has been that programs that perform poorly in a single place also don't scale. So before worrying about scaling, first make sure the single-place version of the code is reasonably efficient.
Get good scaling for a small number of places before attempting larger runs
Quite a bit of scaling performance work can be done with only a few X10 places. In particular, doing runs with 2, 4, and 8 places is usually sufficient to identify non-scalable communication patterns, excessive data transfer, and excessive creation of remote-references to heap allocated objects (which will become memory leaks due to the lack of distributed GC).
Scaling to a very large number of places
Scaling to hundreds or thousands of places requires more careful usage of X10 language features and more careful programming than scaling to a smaller number of places. For some examples of benchmarks that do scale, see the X10 version of the HPCC benchmarks and the UTS benchmark. They can be found in svn at http://x10.svn.sourceforge.net/viewvc/x10/benchmarks/trunk or as an optional download with each X10 release. These codes have been successfully run on large PowerPC clusters and on BlueGene systems. Notice that they very carefully avoid using many X10 language features, do not create very many remote references, and do not uses clocks, whens, atomics, or futures. We are continually updating these codes to be "as nice as possible" within the current limitations of the X10 implementation, so spending some time understanding how this code works and why it is written the way it is will be very helpful if you want to try to scale other codes to large systems. As the X10 implementation matures, we expect it to become progressively easier to scale programs, but it currently requires care and some amount of persistence.
As a rule of thumb, so far most X10 programs that have scaled well to a large number of nodes have had the following control pattern:
- The main activity has a top-level finish under which it uses asyncs to create a top-level async at each place that runs for a long time. That top-level async may in turn use asyncs/at operations fairly freely to communicate asynchronously with the other places.
- Bulk data transfers of primitive (non-pointer containing) data should utilize the asyncCopy methods of IndexedMemoryChunk and Array.
- It may be possible to have nested local finishes, but large numbers of nested distributed finishes probably need to be avoided due to limitations in the current distributed finish implementation.
One way to describe this control pattern as SPMD augmented with active messages.
Of course, one may be able to use more general control patterns on subsets of nodes within a larger computation. We have not deeply explored all of the possible combinations.
We are always interested in more sample X10 programs, especially ones that have been scaled up to run on large systems. If you have one, please consider contributing it back to the project!
Compiler and build options to maximize performance
The default compiler/build options are set to minimize compile time, not maximize performance!
To do performance evaluation of X10, you need to be sure to use the right set of flags. Using the default flags will result in significantly lower than peak performance because (as is standard with C++ compilers), the default set of options result in compiling with no optimizations.
There are two options you need to pass to either the x10c++ or x10c compiler for peak performance
| Option | Semantics |
|---|---|
| -O | enable optimization |
| -NO_CHECKS | disable array bounds checking, null pointer checking, and place checking |
Depending on the kind of code you are running, -NO_CHECKS may or may not have significant performance impact. For array based code where the arrays have rank>1, -NO_CHECKS is currently critical for maximizing performance as the multi-dimensional array bounds checking code has not been designed for high performance.
You will also want to make sure that the X10 class libraries are compiled with the proper options as well. If you are using a pre-built binary release of X10, the standard library was compiled with -O, but not with -NO_CHECKS. For absolutely peak performance, you would need to rebuild the class libraries with -NO_CHECKS, however since the largest impact of -NO_CHECKS is for arrays and the basic array functions will be inlined into the application code with -O you may be able to simply compile the application code with -O -NO_CHECKS with minimal performance loss vs. recompiling the standard libraries as well.
Here is how to build the X10 standard libraries from source for absolute peak performance.
cd x10.dist ant distclean; ant dist -Doptimize=true -DNO_CHECKS=true
And invoke x10c++ to compile your application like:
x10c++ -O -NO_CHECKS <....rest of command line...>Properly configuring the X10 Runtime
The X10 runtime internally executes asyncs by scheduling them on a pool of worker threads within each place. By default, the X10 runtime only creates a single worker thread for place. To exploit multiple cores within a place, you must set the X10_NTHREADS environment variable to the desired number of worker threads to properly exploit the additional cores. A good rule of thumb is to create one X10 worker thread per available core. For example, suppose you wanted to run an X10 program with two places on a machine with 8 cores and wanted the program to use all the available cores. Then you should set X10_NTHREADS=4 (2 places x 4 threads per place = 8 active cores). The X10 runtime will endeavor to keep the number of active worker threads in its pool at the requested value by dynamically adding/removing threads as needed. For more details and related issues please see the Runtime section of the Performance Model paper.
Selecting the right X10RT Implementation
The sockets implementation of X10RT is supported on all platforms, but multi-place programs using it may not perform as well as alternative transports (higher latency, lower bandwidth). If it is available for your platform, use pgas_lapi instead of sockets. As a second choice, use the MPI-based implementation of X10RT.
For more details, see X10RT Implementations