-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate the use of Alpaka and CUPLA #325
Comments
@makortel has set up a standalone version of the pixel unpacker to simplify testing. |
Here are some simple instructions using CUPLA without CMake, which would likely be incompatible with CMSSW's scram. Set up the environmentBASE=$PWD
export CUDA_ROOT=/usr/local/cuda-10.0
export ALPAKA_ROOT=$BASE/alpaka
export CUPLA_ROOT=$BASE/cupla
CXX="/usr/bin/g++-7"
CXX_FLAGS="-m64 -std=c++11 -g -O2 -DALPAKA_DEBUG=0 -DCUPLA_STREAM_ASYNC_ENABLED=1 -I$CUDA_ROOT/include -I$ALPAKA_ROOT/include -I$CUPLA_ROOT/include"
HOST_FLAGS="-fPIC -ftemplate-depth-512 -Wall -Wextra -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-local-typedefs -Wno-attributes -Wno-reorder -Wno-sign-compare"
NVCC="$CUDA_ROOT/bin/nvcc"
NVCC_FLAGS="-ccbin $CXX -lineinfo --expt-extended-lambda --expt-relaxed-constexpr --generate-code arch=compute_50,code=sm_50 --use_fast_math --ftz=false --cudart shared" Download alpaka and cuplagit clone [email protected]:ComputationalRadiationPhysics/alpaka.git -b 0.3.5 $ALPAKA_ROOT
git clone [email protected]:ComputationalRadiationPhysics/cupla.git -b 0.1.1 $CUPLA_ROOT Build cupla ...... for the CUDA backendFILES="$CUPLA_ROOT/src/*.cpp $CUPLA_ROOT/src/manager/*.cpp"
mkdir -p $CUPLA_ROOT/build/cuda $CUPLA_ROOT/lib
cd $CUPLA_ROOT/build/cuda
for FILE in $FILES; do
$NVCC -DALPAKA_ACC_GPU_CUDA_ENABLED $CXX_FLAGS $NVCC_FLAGS -Xcompiler "$HOST_FLAGS" -x cu -c $FILE -o $(basename $FILE).o
done
$NVCC -DALPAKA_ACC_GPU_CUDA_ENABLED $CXX_FLAGS $NVCC_FLAGS -Xcompiler "$HOST_FLAGS" -shared *.o -o $CUPLA_ROOT/lib/libcupla-cuda.so ... for the serial backendFILES="$CUPLA_ROOT/src/*.cpp $CUPLA_ROOT/src/manager/*.cpp"
mkdir -p $CUPLA_ROOT/build/serial $CUPLA_ROOT/lib
cd $CUPLA_ROOT/build/serial
for FILE in $FILES; do
$CXX -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED $CXX_FLAGS $HOST_FLAGS -c $FILE -o $(basename $FILE).o
done
$CXX -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED $CXX_FLAGS $HOST_FLAGS -shared *.o -o $CUPLA_ROOT/lib/libcupla-serial.so ... for the TBB backendFILES="$CUPLA_ROOT/src/*.cpp $CUPLA_ROOT/src/manager/*.cpp"
mkdir -p $CUPLA_ROOT/build/tbb $CUPLA_ROOT/lib
cd $CUPLA_ROOT/build/tbb
for FILE in $FILES; do
$CXX -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED $CXX_FLAGS $HOST_FLAGS -c $FILE -o $(basename $FILE).o
done
$CXX -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED $CXX_FLAGS $HOST_FLAGS -shared *.o -ltbbmalloc -ltbb -lpthread -lrt -o $CUPLA_ROOT/lib/libcupla-tbb.so Build an example with cuplaUsing CUDA on the gpucd $BASE
$NVCC -DALPAKA_ACC_GPU_CUDA_ENABLED $NVCC_FLAGS -x cu $CUPLA_ROOT/example/CUDASamples/vectorAdd/src/vectorAdd.cpp -o cuda-vectorAdd -L$CUPLA_ROOT/lib -lcupla-cuda Using the serial backend on the cpucd $BASE
$CXX -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED $CXX_FLAGS $CUPLA_ROOT/example/CUDASamples/vectorAdd/src/vectorAdd.cpp -o serial-vectorAdd -L$CUPLA_ROOT/lib -lcupla-serial -lpthread Using the TBB backend on the cpucd $BASE
$CXX -DALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED $CXX_FLAGS $CUPLA_ROOT/example/CUDASamples/vectorAdd/src/vectorAdd.cpp -o tbb-vectorAdd -L$CUPLA_ROOT/lib -lcupla-tbb -lpthread |
Here is a first reimplementation of the CUDA kernel with CUPLA, following the instructions in the porting guide. |
Here is an optimised version of the cupla implementation, following the discussion in the tuning guide and adding support for the TBB backend. With these changes, I get the following performance on my laptop (Intel Core i7-6700HQ 2.60GHz, NVIDIA GeForce GTX 960M):
On the cpu, the serial and (single core) tbb versions are only 5-8% slower than the naïve implementation; the tbb implementation running on four cores is roughly 3 times faster. On the gpu side, using cupla's CUDA backend does not seem to introduce any significant overhead over the native CUDA version. |
On the other hand, we are primarily looking for making the kernel code portable (right?), so the EDProducer classes could still call (although it would be an interesting exercise to think of the consequences of making the EDProducers and hence(?) EDProducts "portable" as well) |
Good point... I will ask what, exactly, is not thread safe. |
Some more comments about Alpaka and Cupla. Alpaka
Cupla |
We should look into Alpaka, and possibly CUPLA, as a way of semi-automatically port the CUDA kernels back to run on the cpu (and potentially other accelerators in the future).
Some links:
The text was updated successfully, but these errors were encountered: