-
Notifications
You must be signed in to change notification settings - Fork 15
GPU Load balancing Abstraction for Irregular Computations
An open-source GPU load-balancing framework for applications that exhibit irregular parallelism. The set of applications and algorithms we consider are fundamental to computing tasks ranging from sparse machine learning, large numerical simulations, and on through to graph analytics. The underlying data and data structures that drive these applications to exhibit access patterns that naturally don't map well to the GPU's architecture that is designed with dense and regular access patterns in mind. Prior to work we present and propose here, the only way to unleash the GPU's full power on these problems has been to workload balance through tightly coupled load-balancing techniques. Our proposed load-balancing abstraction decouples load-balancing from work processing. It aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules in the future. With our open-source framework, we hope to improve programmers' productivity when developing irregular-parallel algorithms on the GPU and improve the overall performance characteristics for such applications by allowing a quick path to experimentation with a variety of existing load-balancing techniques. Consequently, we also hope that by separating the concerns of load-balancing from work processing within our abstraction, managing and extending existing code to future architectures becomes easier.
We provide two APIs for our library, one that focuses on a beginner-friendly approach to load-balancing irregular sparse computations and another that allows advanced programmers to retain control of the GPU kernels and express load-balanced execution as ranged loops. Both approaches are highlighted below.
Our Load-balanced execution API builds on the approach defined in gunrock/essentials
where we identify key primitives used in computing sparse linear algebra, graph analytics, and other irregular computations alike. Load-balanced versions of these primitives are then implemented, such that the user gets access to the atom, tile, and processor id they are working on as the C++ lambda signature.
Users define their computation within the C++ lambda, which gets called by the load-balanced primitive for every instance of the work atom.
In this simple example we are using Compressed Sparse Row (CSR) format and simply returning the number of atoms
(nonzeros) in each row as our layout.
auto layout = [=] __device__ (std::size_t tile_id) {
return offsets[tile_id+1] – offsets[tile_id];
}
// user-defined compute: y = Ax
auto spmv = [=] __device__ (std::size_t atom_id, std::size_t tile_id, std::size_t proc_id) {
return values[atom_id] * x[column_indices[atom_id]];
}
Requires the load-balancing schedule (work_oriented
in this example) as a templated parameter.
The transformation as a C++ lambda expressions (spmv
) and the layout
as an input to perform a segmented reduction.
The output of the C++ lambda expression gets reduced by segments defined using A.offsets
.
lb::transform_segreduce<lb::work_oriented>
(spmv, layout, A.nonzeros, A.offsets,
G.rows, y, lb::plus_t(),
0.0f, stream);
- Requires no knowledge of how to implement segmented reduction,
- Very simple API if the computation can be defined using C++ lambda expressions.
- No control over kernel execution and dispatch configuration,
- No composability; cannot implement more complicated computations that may have cooperative properties among processors.
SpMV problem-specific kernel parameters.
template <std::size_t threads_per_block,
typename index_t,
typename offset_t,
typename type_t>
__global__ void __launch_bounds__(threads_per_block, 2)
spmv(std::size_t rows,
std::size_t cols,
std::size_t nnz,
offset_t* offsets,
index_t* indices,
const type_t* values,
const type_t* x,
type_t* y) {
Allocates any temporary memory required for load-balancing, as well as constructs a schedule per processors partition (defined using cooperative groups).
using setup_t = schedule::setup<schedule::algroithms_t::tile_mapped,
threads_per_block, 32, index_t, offset_t>;
/// Allocate temporary storage for the schedule.
using storage_t = typename setup_t::storage_t;
__shared__ storage_t temporary_storage;
/// Construct the schedule.
setup_t config(temporary_storage, offsets, rows, nnz);
auto p = config.partition();
(2) Load-balanced Ranged Loops (see; C++ ranges)
In this example, we define two iteration spaces; virtual and real. Virtual spaces allow us to balance atoms and tiles onto the processor ids and link directly to the real iteration space, which returns the exact atom or tile being processed. The code below loops over all balanced number of atoms fetches the tile corresponding to the atom being processed and allows user to define their computation.
for (auto virtual_atom : config.atom_accessor(p)) {
auto virtual_tile = config.tile_accessor(virtual_atom, p);
if (!(config.is_valid_accessor(virtual_tile, p)))
continue;
auto row = config.tile_id(virtual_tile, p);
auto nz_idx = config.atom_id(virtual_atom, row, virtual_tile, p);
Once the user has access to the atom, tile, and the processor id, they implement the desired computation on the given tuple. In this example, we use a simple atomicAdd
to perform SpMV (can be improved).
atomicAdd(&(y[row]), values[nz_idx] * x[indices[nz_idx]]);
}
}
template <typename index_t,
typename offset_t,
typename type_t>
__global__ void spmv(const std::size_t rows,
const std::size_t cols,
const std::size_t nnz,
const offset_t* offsets,
const index_t* indices,
const type_t* values,
const type_t* x,
type_t* y) {
/// Equivalent to:
/// row = blockIdx.x * blockDim.x + threadIdx.x; (init)
/// row < rows; (boundary condition)
/// row += gridDim.x * blockDim.x. (step)
for (auto row : config.tiles()) {
type_t sum = 0;
/// Equivalent to:
/// for (offset_t nz = offset; nz < end; ++nz)
for (auto nz : config.atoms(row)) {
sum += values[nz] * x[indices[nz]];
}
// Output
y[row] = sum;
}
}
Essentials © 2022 The Regents of the University of California
- Programming Model
- Gunrock Operators
- Graph Algorithms
- Getting Essentials
- (GitHub Template)
essentials
project example
- MGPU, Python, Docs (needs review)
- Boolmap Frontier
- Hypergraphs (private)
- Modern CPP Features
- Programming Interface Examples (API)
- Style Guide
- Understanding the code structure
- Git Workflow
-
Debugging with
cuda-memcheck
andcuda-gdb
- Profiling with NVIDIA Nsight Systems and Compute
- Unit testing with GoogleTest
- Performance analysis
- How to write a new graph algorithm
- PageRank: PageRank: From
networkx
togunrock essentials
- How to write parallel operators
- How to add a new graph representation
- How to add a new frontier representation
- How to add multiple GPU support
- How to bind an application to python
- How to use
thrust
/cub
- Writing sparse-matrix dense-vector multiplication using graphs
- Variadic Inheritance
- Polymorphic-Virtual (Diamond) Inheritance
- Need for custom copy constructor
- CUDA-enabled
std::shared_ptr
- Ubuntu
-latest
- Windows
-latest
- Doxygen
- Code Quality