A transpiler from CUDA to ISPC using clang libTooling. Some of the experiments that I had been doing can be found here. The tools was developed as a part of Google Summer of Code project under CERN with Daniel Hugo Cámpora Perez and Axel Naumann as mentors.
The presentations will give you a detailed description of the project, it's motivation and future directions.
- CUDA 9.0 - 9.2
- ISPC - use built in alloy.py script to install
- clang 9 - with libclang and llvm-tools
- CMake 3.5.0 or greater
There is a Dockerfile provided with the project to the ease installation process
mkdir build && cd build
cmake -G Ninja -DLLVM_DIR=path_to_llvm_cmake_dir ..
// CUDA
__global__ void reduce(int *a, int *partial_sum, int N) {
size_t tid = threadIdx.x;
size_t gid = threadIdx.x + blockIdx.x * blockDim.x;
for (size_t s = N / 2; s > 0; s >>= 1) {
if (tid < s) {
a[gid] += a[gid + s];
}
__syncthreads();
}
if (tid == 0)
partial_sum[blockIdx.x] = a[blockIdx.x * blockDim.x];
}
// ISPC
#define ISPC_GRID_START \
Dim3 blockIdx, threadIdx; \
for (blockIdx.z = 0; blockIdx.z < gridDim.z; blockIdx.z++) { \
for (blockIdx.y = 0; blockIdx.y < gridDim.y; blockIdx.y++) { \
for (blockIdx.x = 0; blockIdx.x < gridDim.x; blockIdx.x++) {
#define ISPC_BLOCK_START \
for (threadIdx.z = 0; threadIdx.z < blockDim.z; threadIdx.z++) { \
for (threadIdx.y = 0; threadIdx.y < blockDim.y; threadIdx.y++) { \
for (threadIdx.x = programIndex; threadIdx.x < blockDim.x; \
threadIdx.x += programCount) {
#define ISPC_KERNEL(function, ...) \
export void function( \
const uniform Dim3 &gridDim, const uniform Dim3 &blockDim, \
const uniform size_t &shared_memory_size, __VA_ARGS__)
struct Dim3 {
int x, y, z;
};
ISPC_KERNEL(reduce, uniform int a[], uniform int partial_sum[], uniform int N) {
ISPC_GRID_START
ISPC_BLOCK_START
const unsigned int64 tid = threadIdx.x;
const unsigned int64 gid = threadIdx.x + blockIdx.x * blockDim.x;
ISPC_BLOCK_END
for (size_t s = N / 2; s > 0; s >>= 1) {
ISPC_BLOCK_START
const unsigned int64 tid = threadIdx.x;
const unsigned int64 gid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < s) {
a[gid] += a[gid + s];
}
ISPC_BLOCK_END
}
ISPC_BLOCK_START
const unsigned int64 tid = threadIdx.x;
const unsigned int64 gid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid == 0) {
partial_sum[blockIdx.x] = a[blockIdx.x * blockDim.x];
}
ISPC_BLOCK_END
ISPC_GRID_END
}
Sphnix and doxygen documentation of the tool can be generated by passing docs/all
to the build system.
./spmdfy ../examples/transpose/transpose.cu -o transpose.ispc
Clang uses a compilation database to pass additional command line arguments. You can generate using cmake by passing CMAKE_EXPORT_COMPILE_COMMANDS
which will dump compile_commands.json
. If your codebase is compile nvcc, you can convert nvcc specific flags to clang's by running the tool here.
- Shared Memory - both dynamic and static
- Atomic Functions
- Syncthreads with complex control flow
- Some CUDA Math libraries
- Device Functions
- Python Tool to convert compilation database
- Dataflow analysis to detect partial nodes
- Inline of Device functions
- More C++ stuff - Convert C++ to C as ISPC is a C language.
List of tests that are currently working with the tool.
- CUDA_Features/Shared_Memory
- CUDA_Features/Atomic
- Finite Difference
- Transpose
- Saxpy
- Reduce