Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sparse matrix to mp #860

Open
wants to merge 68 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 66 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
00f1e39
Add initial implementation of sparse matrix in mp
Xewar313 Aug 14, 2024
36d7f38
Fixed row shape calculation
Xewar313 Aug 14, 2024
dd57bb1
Extract matrix format from matrix implementation
Xewar313 Aug 19, 2024
b84ecc0
Add initial gemv implementation
Xewar313 Aug 21, 2024
1c1dad7
Move matrix related files from sp to general module
Xewar313 Aug 21, 2024
2456379
Separated matrix format from mp sparse matrix implementation and adde…
Xewar313 Aug 27, 2024
bd63c2c
Improve matrix loading performance
Xewar313 Aug 28, 2024
b75b7ed
Add sycl support to mp sparse matrixes
Xewar313 Sep 3, 2024
756fab1
Added initialization from one node in mp sparse matrix
Xewar313 Sep 4, 2024
0b498ad
Add concept requirement for gemv operation
Xewar313 Sep 4, 2024
bbf2acf
Initial improvement to matrix reading
Xewar313 Sep 4, 2024
9eca244
Add small improvements to matrix loading
Xewar313 Sep 9, 2024
7b55a1b
Fix formatting
Xewar313 Sep 9, 2024
bb2e02e
Add sparse benchmark and broadcasted vector
Xewar313 Sep 17, 2024
18165da
Add benchmarking tools
Xewar313 Sep 17, 2024
94f818e
Add gemv benchmark to gbench
Xewar313 Sep 18, 2024
982a0e0
Add reference gemv implementation
Xewar313 Sep 24, 2024
47a8455
Fixed gemv reference
Xewar313 Sep 24, 2024
a97a97b
Fixed gemv benchmark implementation
Xewar313 Sep 25, 2024
231a09a
Fix band csr generation
Xewar313 Sep 30, 2024
6b8af49
Add support for slim matrix multiplication
Xewar313 Oct 1, 2024
628aa07
Fix benchmark and band csr generation
Xewar313 Oct 1, 2024
2047ecf
Merge branch 'benchmark' of github.com:Xewar313/distributed-ranges in…
Xewar313 Oct 1, 2024
4f12327
Add support to device based computing in distributed sparse matrix
Xewar313 Oct 2, 2024
71bd336
add broadcasted slim matrix device memory support
Xewar313 Oct 2, 2024
6f96929
Fix issue with inconsistent timing when using mp gemv
Xewar313 Oct 7, 2024
08a2247
Some fixes to sparse matrixes
Xewar313 Oct 8, 2024
6a4bd30
improve work division in csr eq distribution
Xewar313 Oct 9, 2024
f93961b
Add better work distribution to csr_row_distiribution and fix distrib…
Xewar313 Oct 16, 2024
e421523
improve performance on less dense matrices and allow broadcasting big…
Xewar313 Oct 18, 2024
3fa4a68
Reversed change to eq distribution
Xewar313 Oct 18, 2024
0a1a4dc
update some examples and benchmarks
Xewar313 Oct 25, 2024
2beec18
Improved communication in eq distribution
Xewar313 Nov 4, 2024
5edd0ba
Improve equ format on very sparse matrices
Xewar313 Nov 5, 2024
5800f99
Merge branch 'main' into benchmark
Xewar313 Nov 5, 2024
cae67ef
Fix test compilation
Xewar313 Nov 5, 2024
28519e0
Reformat changes in mp matrix #1
Xewar313 Nov 5, 2024
f2c2fbe
Fix and improve tests
Xewar313 Nov 8, 2024
3a00d61
Add tests for sparse gemm in mp
Xewar313 Nov 8, 2024
2ec4b21
Reformat changes in mp matrix #2
Xewar313 Nov 8, 2024
05f5c63
Fix compilation on borealis
Xewar313 Nov 12, 2024
06a6628
fix compilation
Xewar313 Nov 13, 2024
aa706f7
Fix issues with very small and very big matrices
Xewar313 Nov 13, 2024
6e0e9d2
Merge branch 'main' into benchmark
Xewar313 Nov 13, 2024
8f1a2b7
Fix compilation on older OneDpl
Xewar313 Nov 13, 2024
28e023e
Fix style
Xewar313 Nov 13, 2024
e4ee8c7
Merge onedpl fix
Xewar313 Nov 13, 2024
55185dc
Some fixes with verions
Xewar313 Nov 13, 2024
b7704ea
Add local to csr_eq_segment
Xewar313 Nov 20, 2024
4acbad6
Add proper local method
Xewar313 Nov 22, 2024
1f84ba7
Add problem to review
Xewar313 Nov 25, 2024
ba20ee3
Moved local view to distribution
Xewar313 Nov 25, 2024
bad5606
Add new example of not working code
Xewar313 Nov 25, 2024
8e7f1fe
Fix issue with lambda copy
Xewar313 Nov 27, 2024
3503271
Make local work with shared memory
Xewar313 Nov 27, 2024
44a6e78
Fix device memory when using local in row distribution
Xewar313 Nov 29, 2024
2bf503e
Fix local in eq distribution
Xewar313 Dec 2, 2024
dc89bc8
Fix formatting
Xewar313 Dec 2, 2024
7e7f2d2
Reverse change in dr::transform_view
Xewar313 Dec 2, 2024
dd1d6ed
Fix benchmark when default vector size is small
Xewar313 Dec 2, 2024
e42cfa2
Fix issue when distributed vector is too small
Xewar313 Dec 3, 2024
2318a46
Improve performance of eq distribution gather
Xewar313 Dec 3, 2024
4cfb110
Remove unneccessary comment
Xewar313 Dec 3, 2024
04191d7
Add test for reduce and fix type error in sparse matrix local
Xewar313 Dec 9, 2024
adad4f7
Add broadcast_vector tests
Xewar313 Dec 9, 2024
f17243b
Fix formatting
Xewar313 Dec 9, 2024
f1639b0
Corrected gemv matrix creation
Xewar313 Jan 13, 2025
3dfdac0
Fix formatting
Xewar313 Jan 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion benchmarks/gbench/mp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ add_executable(
../common/stream.cpp
streammp.cpp
rooted.cpp
gemv.cpp
stencil_1d.cpp
stencil_2d.cpp
chunk.cpp
Expand All @@ -41,7 +42,7 @@ endif()
# mp-quick-bench is for development. By reducing the number of source files, it
# builds much faster. Change the source files to match what you need to test. It
# is OK to commit changes to the source file list.
add_executable(mp-quick-bench mp-bench.cpp ../common/distributed_vector.cpp)
add_executable(mp-quick-bench mp-bench.cpp gemv.cpp)

foreach(mp-bench-exec IN ITEMS mp-bench mp-quick-bench)
target_compile_definitions(${mp-bench-exec} PRIVATE BENCH_MP)
Expand Down
4 changes: 4 additions & 0 deletions benchmarks/gbench/mp/fft3d.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,11 @@
#include "cxxopts.hpp"
#include "fmt/core.h"
#include "mpi.h"
#if (__INTEL_LLVM_COMPILER >= 20250000)
#include "oneapi/mkl/dft.hpp"
#else
#include "oneapi/mkl/dfti.hpp"
#endif
#include <complex>

#include "dr/mp.hpp"
Expand Down
318 changes: 318 additions & 0 deletions benchmarks/gbench/mp/gemv.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,318 @@
// SPDX-FileCopyrightText: Intel Corporation
//
// SPDX-License-Identifier: BSD-3-Clause

#include "mpi.h"

#include "dr/mp.hpp"
#include <filesystem>
#include <fmt/core.h>
#include <fstream>
#include <random>
#include <sstream>

#ifdef STANDALONE_BENCHMARK

MPI_Comm comm;
int comm_rank;
int comm_size;

#else

#include "../common/dr_bench.hpp"

#endif

namespace mp = dr::mp;

#ifdef STANDALONE_BENCHMARK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see in any of CMakeLists.txt building gemv.cpp with STANDALONE_BENCHMARK option. Code under this ifdef is never compiled. Please remove if you don't need it or add a target to cmakelists to make it being compiled and take care it is compiled during CI (to catch compile time errors in CI)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, I removed it, since it is not strictly necessary for benchmarking

int main(int argc, char **argv) {

MPI_Init(&argc, &argv);
comm = MPI_COMM_WORLD;
MPI_Comm_rank(comm, &comm_rank);
MPI_Comm_size(comm, &comm_size);

if (argc != 3 && argc != 5) {
fmt::print(
"usage: ./sparse_benchmark [test outcome dir] [matrix market file], or "
"./sparse_benchmark [test outcome dir] [number of rows] [number of "
"columns] [number of lower bands] [number of upper bands]\n");
return 1;
}

#ifdef SYCL_LANGUAGE_VERSION
sycl::queue q = dr::mp::select_queue();
mp::init(q);
#else
mp::init();
#endif
dr::views::csr_matrix_view<double, long> local_data;
std::stringstream filenamestream;
auto root = 0;
auto computeSize = dr::mp::default_comm().size();
if (root == dr::mp::default_comm().rank()) {
if (argc == 5) {
fmt::print("started loading\n");
auto n = std::stoul(argv[2]);
auto up = std::stoul(argv[3]);
auto down = std::stoul(argv[4]);
local_data = dr::generate_band_csr<double, long>(n, up, down);
filenamestream << "mp_band_" << computeSize << "_" << n << "_"
<< up + down << "_" << local_data.size();
fmt::print("finished loading\n");
} else {
fmt::print("started loading\n");
std::string fname(argv[2]);
std::filesystem::path p(argv[2]);
local_data = dr::read_csr<double, long>(fname);
filenamestream << "mp_" << p.stem().string() << "_" << computeSize << "_"
<< local_data.size();
fmt::print("finished loading\n");
}
}
std::string resname;
mp::distributed_sparse_matrix<
double, long, dr::mp::MpiBackend,
dr::mp::csr_eq_distribution<double, long, dr::mp::MpiBackend>>
m_eq(local_data, root);
mp::distributed_sparse_matrix<
double, long, dr::mp::MpiBackend,
dr::mp::csr_row_distribution<double, long, dr::mp::MpiBackend>>
m_row(local_data, root);
fmt::print("finished distribution\n");
std::vector<double> eq_duration;
std::vector<double> row_duration;

auto N = 10;
std::vector<double> b;
b.reserve(m_row.shape().second);
std::vector<double> res(m_row.shape().first);
for (auto i = 0; i < m_row.shape().second; i++) {
b.push_back(i);
}

dr::mp::broadcasted_vector<double> allocated_b;
allocated_b.broadcast_data(m_row.shape().second, 0, b,
dr::mp::default_comm());

fmt::print("started initial gemv distribution\n");
gemv(0, res, m_eq, allocated_b); // it is here to prepare sycl for work

fmt::print("finished initial gemv distribution\n");
for (auto i = 0; i < N; i++) {
auto begin = std::chrono::high_resolution_clock::now();
gemv(0, res, m_eq, allocated_b);
auto end = std::chrono::high_resolution_clock::now();
double duration = std::chrono::duration<double>(end - begin).count() * 1000;
eq_duration.push_back(duration);
}

gemv(0, res, m_row, allocated_b); // it is here to prepare sycl for work
for (auto i = 0; i < N; i++) {
auto begin = std::chrono::high_resolution_clock::now();
gemv(0, res, m_row, allocated_b);
auto end = std::chrono::high_resolution_clock::now();
double duration = std::chrono::duration<double>(end - begin).count() * 1000;
row_duration.push_back(duration);
}

if (root == dr::mp::default_comm().rank()) {
std::string tmp;
filenamestream >> tmp;
std::filesystem::path p(argv[1]);
p += tmp;
p += ".csv";
std::ofstream write_stream(p.string());
write_stream << eq_duration.front();
for (auto i = 1; i < N; i++) {
write_stream << "," << eq_duration[i];
}
write_stream << "\n";
write_stream << row_duration.front();
for (auto i = 1; i < N; i++) {
write_stream << "," << row_duration[i];
}
write_stream << "\n";
}
allocated_b.destroy_data();
mp::finalize();
}

#else

namespace {
std::size_t getWidth() {
return 8; // default_vector_size / 100000;
}
} // namespace
static auto getMatrix() {
std::size_t n = std::max(1., std::sqrt(default_vector_size / 100000)) * 50000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default_vector_size is to be more-less equal to size of data being allocated. However when I run all benchamrks on dnp02 host the only tests which fail with bad_alloc are Gemv_DR.

Please add explanation to the code what are 100000 and 50000 constants for.

Please make the benchamrk allocating more less the same ammount of data like other benchmars so command:\

ONEAPI_DEVICE_SELECTOR='level_zero:gpu' I_MPI_OFFLOAD=1 \
I_MPI_OFFLOAD_CELL_LIST=0-11 I_MPI_HYDRA_BOOTSTRAP=ssh \
mpiexec -n 2 -ppn 2  build/benchmarks/gbench/mp/mp-bench --vector-size 1000000000 \
--reps 50 --v=3 --benchmark_filter=GemvEq_DR/ --sycl

will not fail on kdse-pre-dnp-02

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I corrected sizes of the matrix, so that data size is actually similar (not exactly the same, but very close).

The line with 100000 and 50000 was commented out, and added comment that that matrix size is suitable for weak testing with dr-bench.

The command should also be working on kdse-pre-dnp-02

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's important, it may sometimes freeze - it is due to gtest deciding on some hosts to stop iterating state, and not on others. I am not sure why it decides that, but in that case the test needs to be re-run

// std::size_t n = default_vector_size / 2;
std::size_t up = n / 50;
std::size_t down = n / 50;
// assert(dr::mp::use_sycl());
// assert(dr::mp::sycl_mem_kind() == sycl::usm::alloc::device);
return dr::generate_band_csr<double, long>(n, up, down);

// return dr::read_csr<double,
// long>("/home/komarmik/examples/soc-LiveJournal1.mtx"); return
// dr::read_csr<double, long>("/home/komarmik/examples/mycielskian18.mtx");
// return dr::read_csr<double,
// long>("/home/komarmik/examples/mawi_201512020030.mtx");
}

static void GemvEq_DR(benchmark::State &state) {
auto local_data = getMatrix();

mp::distributed_sparse_matrix<
double, long, dr::mp::MpiBackend,
dr::mp::csr_eq_distribution<double, long, dr::mp::MpiBackend>>
m(local_data, 0);
auto n = m.shape()[1];
auto width = getWidth();
std::vector<double> base_a(n * width);
for (int j = 0; j < width; j++) {
for (int i = 0; i < n; i++) {
base_a[i + j * n] = i * j + 1;
}
}
dr::mp::broadcasted_slim_matrix<double> allocated_a;
allocated_a.broadcast_data(n, width, 0, base_a, dr::mp::default_comm());

std::vector<double> res(m.shape().first * width);
gemv(0, res, m, allocated_a);
for (auto _ : state) {
gemv(0, res, m, allocated_a);
}
}

DR_BENCHMARK(GemvEq_DR);

static void GemvRow_DR(benchmark::State &state) {
auto local_data = getMatrix();

mp::distributed_sparse_matrix<
double, long, dr::mp::MpiBackend,
dr::mp::csr_row_distribution<double, long, dr::mp::MpiBackend>>
m(local_data, 0);
auto n = m.shape()[1];
auto width = getWidth();
std::vector<double> base_a(n * width);
for (int j = 0; j < width; j++) {
for (int i = 0; i < n; i++) {
base_a[i + j * n] = i * j + 1;
}
}
dr::mp::broadcasted_slim_matrix<double> allocated_a;
allocated_a.broadcast_data(n, width, 0, base_a, dr::mp::default_comm());

std::vector<double> res(m.shape().first * width);
gemv(0, res, m, allocated_a);
for (auto _ : state) {
gemv(0, res, m, allocated_a);
}
}

DR_BENCHMARK(GemvRow_DR);

static void Gemv_Reference(benchmark::State &state) {
auto local_data = getMatrix();
auto nnz_count = local_data.size();
auto band_shape = local_data.shape();
auto q = get_queue();
auto policy = oneapi::dpl::execution::make_device_policy(q);
auto val_ptr = sycl::malloc_device<double>(nnz_count, q);
auto col_ptr = sycl::malloc_device<long>(nnz_count, q);
auto row_ptr = sycl::malloc_device<long>((band_shape[0] + 1), q);
std::vector<double> b;
auto width = getWidth();
for (auto i = 0; i < band_shape[1] * width; i++) {
b.push_back(i);
}
double *elems = new double[band_shape[0] * width];
auto input = sycl::malloc_device<double>(band_shape[1] * width, q);
auto output = sycl::malloc_device<double>(band_shape[0] * width, q);
q.memcpy(val_ptr, local_data.values_data(), nnz_count * sizeof(double))
.wait();
q.memcpy(col_ptr, local_data.colind_data(), nnz_count * sizeof(long)).wait();
q.memcpy(row_ptr, local_data.rowptr_data(),
(band_shape[0] + 1) * sizeof(long))
.wait();
q.fill(output, 0, band_shape[0] * width);
std::copy(policy, b.begin(), b.end(), input);

auto wg = 32;
while (width * band_shape[0] * wg > INT_MAX) {
wg /= 2;
}
assert(wg > 0);

for (auto _ : state) {
if (dr::mp::use_sycl()) {
dr::mp::sycl_queue()
.submit([&](auto &&h) {
h.parallel_for(
sycl::nd_range<1>(width * band_shape[0] * wg, wg),
[=](auto item) {
auto input_j = item.get_group(0) / band_shape[0];
auto idx = item.get_group(0) % band_shape[0];
auto local_id = item.get_local_id();
auto group_size = item.get_local_range(0);
double sum = 0;
auto start = row_ptr[idx];
auto end = row_ptr[idx + 1];
for (auto i = start + local_id; i < end; i += group_size) {
auto colNum = col_ptr[i];
auto vectorVal = input[colNum + input_j * band_shape[1]];
auto matrixVal = val_ptr[i];
sum += matrixVal * vectorVal;
}
sycl::atomic_ref<double, sycl::memory_order::relaxed,
sycl::memory_scope::device>
c_ref(output[idx + band_shape[0] * input_j]);
c_ref += sum;
});
})
.wait();
q.memcpy(elems, output, band_shape[0] * sizeof(double) * width).wait();
} else {
std::fill(elems, elems + band_shape[0] * width, 0);
auto local_rows = local_data.rowptr_data();
auto row_i = 0;
auto current_row_position = local_rows[1];

for (int i = 0; i < nnz_count; i++) {
while (row_i + 1 < band_shape[0] && i >= current_row_position) {
row_i++;
current_row_position = local_rows[row_i + 1];
}
for (auto j = 0; j < width; j++) {
auto item_id = row_i + j * band_shape[0];
auto val_index = local_data.colind_data()[i] + j * band_shape[0];
auto value = b[val_index];
auto matrix_value = local_data.values_data()[i];
elems[item_id] += matrix_value * value;
}
}
}
}
delete[] elems;
sycl::free(val_ptr, q);
sycl::free(col_ptr, q);
sycl::free(row_ptr, q);
sycl::free(input, q);
sycl::free(output, q);
}

static void GemvEq_Reference(benchmark::State &state) { Gemv_Reference(state); }

static void GemvRow_Reference(benchmark::State &state) {
Gemv_Reference(state);
}

DR_BENCHMARK(GemvEq_Reference);

DR_BENCHMARK(GemvRow_Reference);

#endif
4 changes: 4 additions & 0 deletions benchmarks/gbench/sp/fft3d.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,11 @@
// SPDX-License-Identifier: BSD-3-Clause

#include "cxxopts.hpp"
#if (__INTEL_LLVM_COMPILER >= 20250000)
#include "oneapi/mkl/dft.hpp"
#else
#include "oneapi/mkl/dfti.hpp"
#endif
#include <complex>
#include <dr/sp.hpp>
#include <fmt/core.h>
Expand Down
8 changes: 8 additions & 0 deletions examples/mp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,18 @@ function(add_mp_example example_name)
add_mp_ctest(TEST_NAME ${example_name} NAME ${example_name} NPROC 2)
endfunction()

function(add_mp_example_no_test example_name)
add_executable(${example_name} ${example_name}.cpp)
target_link_libraries(${example_name} cxxopts DR::mpi)
endfunction()

add_mp_example(stencil-1d)
add_mp_example(stencil-1d-array)
add_mp_example(stencil-1d-pointer)
add_mp_example(hello_world)
add_mp_example_no_test(sparse_matrix)
add_mp_example_no_test(sparse_benchmark)
add_mp_example_no_test(sparse_matrix_matrix_mul)

if(OpenMP_FOUND)
add_executable(vector-add-ref vector-add-ref.cpp)
Expand Down
Loading