Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor memory ownership svm #6073

Open
wants to merge 5 commits into
base: branch-24.10
Choose a base branch
from

Conversation

mfoerste4
Copy link
Contributor

@mfoerste4 mfoerste4 commented Sep 20, 2024

As discussed in #6057 this is a PR to refactor the raw memory allocations in the svm model which are passed back to the caller.

I have removed all raw allocation pointers from the SVM model and replaced them by device_buffers. Since ownership is more restrictive now we do have additional data copies when:

  • passing a model into the C-API
  • retrieving a model via the C-API
  • un-pickling/deserializing a model in python (removed copy with latest commit)

CC @tfeher, @divyegala

@github-actions github-actions bot added Cython / Python Cython or Python issue CUDA/C++ labels Sep 20, 2024
@mfoerste4 mfoerste4 self-assigned this Sep 23, 2024
@mfoerste4 mfoerste4 added Tech Debt Issues related to debt improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 23, 2024
@mfoerste4 mfoerste4 marked this pull request as ready for review September 23, 2024 20:16
@mfoerste4 mfoerste4 requested review from a team as code owners September 23, 2024 20:16
@mfoerste4 mfoerste4 changed the title [DRAFT] refactor memory ownership svm refactor memory ownership svm Sep 23, 2024
Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the C and Python APIs need a copy and ultimately, this is because rmm::device_buffer in this structure as well as the model is owned by the struct/class. Instead, if we use std::weak_ptr<rmm::device_buffer>> or plain rmm::device_buffer& then we can avoid copies and ownership issues by letting the C and Python API own the data but still allowing the C++ API to resize it.

The C API can also release the shared_ptr/device_buffer to transfer ownership to the user, thus removing the need for a copy there.

The Python API is owned by us and so, we can create shared_ptr/device_buffer in the Cython layer and store them as class members.

cpp/src/svm/results.cuh Outdated Show resolved Hide resolved
cpp/src/svm/smosolver.cuh Outdated Show resolved Hide resolved
Comment on lines 73 to 76
rmm::device_uvector<math_t> unique_labels(0, stream);
model.n_classes = raft::label::getUniquelabels(unique_labels, labels, n_rows, stream);
rmm::device_async_resource_ref rmm_alloc = rmm::mr::get_current_device_resource();
model.unique_labels = (math_t*)rmm_alloc.allocate_async(
model.n_classes * sizeof(math_t), rmm::CUDA_ALLOCATION_ALIGNMENT, stream);
raft::copy(model.unique_labels, unique_labels.data(), model.n_classes, stream);
model.unique_labels.resize(model.n_classes * sizeof(math_t), stream);
raft::copy((math_t*)model.unique_labels.data(), unique_labels.data(), model.n_classes, stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is unique_labels being re-used elsewhere or can this copy be avoided by directly passing modeul.unique_labels to getUniquelabels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is only used locally. However, I cannot pass a plain device_buffer into raft::label::getUniquelabels as it only takes device_uvector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use unique_labels.release() to get an r-value device_buffer and avoid the copy https://github.com/rapidsai/rmm/blob/b51447393c523cc929608d84850c70a3eae08af3/include/rmm/device_uvector.hpp#L414


// Unfortunately we need runtime support for both types
raft::device_matrix_view<math_t, int, raft::layout_stride> dense_support_matrix_view;
if (is_dense_support) {
dense_support_matrix_view =
raft::make_device_strided_matrix_view<math_t, int, raft::layout_f_contiguous>(
model.support_matrix.data, model.n_support, n_cols, 0);
(math_t*)model.support_matrix.data.data(), model.n_support, n_cols, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reinterpret_cast

Copy link
Contributor Author

@mfoerste4 mfoerste4 Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have switched to reinterpret_cast in many places now. Here in this context I cannot easily switch as the model is const. To my understanding resolving this here would require subsequent const_casts or worse.

n_cols,
model.support_matrix.nnz)
? raft::make_device_compressed_structure_view<int, int, int>(
(int*)model.support_matrix.indptr.data(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reinterpret_cast

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

? raft::make_device_csr_matrix_view<math_t, int, int, int>(model.support_matrix.data,
csr_structure_view)
? raft::make_device_csr_matrix_view<math_t, int, int, int>(
(math_t*)model.support_matrix.data.data(), csr_structure_view)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reinterpret_cast

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@@ -278,7 +277,7 @@ void svcPredictX(const raft::handle_t& handle,
&one,
K.data(),
transpose_kernel ? model.n_support : n_batch,
model.dual_coefs,
(math_t*)model.dual_coefs.data(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reinterpret_cast

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@@ -287,7 +286,7 @@ void svcPredictX(const raft::handle_t& handle,

} // end of loop

math_t* labels = model.unique_labels;
math_t* labels = (math_t*)model.unique_labels.data();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reinterpret_cast

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

}
cudaStream_t stream = handle.get_stream();

// Note that the underlying allocations are not *freed* but rather reset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is the case with resize(). Why not deallocate_async() instead to free memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deallocate_async() is a private method. I added additional calls to shrink_to_fit() to ensure memory is freed right away.

cpp/src/svm/svm_api.cpp Outdated Show resolved Hide resolved
@mfoerste4
Copy link
Contributor Author

Thanks @divyegala for the rewiew. I have adressed or commented on your immediate suggestions.

I see that the C and Python APIs need a copy and ultimately, this is because rmm::device_buffer in this structure as well as the model is owned by the struct/class. Instead, if we use std::weak_ptr<rmm::device_buffer>> or plain rmm::device_buffer& then we can avoid copies and ownership issues by letting the C and Python API own the data but still allowing the C++ API to resize it.

The C API can also release the shared_ptr/device_buffer to transfer ownership to the user, thus removing the need for a copy there.

The Python API is owned by us and so, we can create shared_ptr/device_buffer in the Cython layer and store them as class members.

I don't quite understand how this can be accomplished with the given device_buffer implementation. Even if we pass the storages as unique pointers - we then can pass ownership of the whole device-buffer, but we will not be able to pass in/out actual memory allocations within the device_buffer.

For the python API we retrieve data as CumlArray and want to pass the ptr into the device_buffer. For the C-API we have fixed pointer style I/O where cannot simply switch to device_buffer*. IIUC we would need a class like the device_buffer but with additional capabilities to switch in between owning and not-owning.

@divyegala
Copy link
Member

@mfoerste4

Let me write an example for Python API and how to handle ownership, assuming the C++ API does not own the memory and instead holds onto references of the form rmm::device_buffer&. As for the C API, I think you are right and there's no way for us to release the ownership of memory, but at least we can avoid copies in the Python world:

Python API:
As mentioned in the original issue, instead of CumlArray we can use DeviceBuffer from RMM Python

  1. .pxd
  2. .pyx

As seen in the .pxd definition, the Python DeviceBuffer class holds an underlying reference to the C++ object device_buffer in the form of the property unique_ptr<device_buffer> c_obj. Thus, the code looks like:

class SVM:
def __init__(self):
  self.some_array = rmm.DeviceBuffer(...) # instead of CumlArray
  
def some_func(self):
  cpp_func(deref(some_array.c_obj.get()), ...)

@mfoerste4
Copy link
Contributor Author

@divyegala , thanks for your suggestions. I have refactored the SvmModel to work with device_buffer* in order to paste in the buffers after (de-)serialization without a copy.

Comment on lines +75 to +76
model.unique_labels->resize(model.n_classes * sizeof(math_t), stream);
raft::copy((math_t*)model.unique_labels->data(), unique_labels.data(), model.n_classes, stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference: https://github.com/rapidsai/rmm/blob/c494395e58288cac16321ce90e9b15f3508ae89a/include/rmm/device_uvector.hpp#L414

Suggested change
model.unique_labels->resize(model.n_classes * sizeof(math_t), stream);
raft::copy((math_t*)model.unique_labels->data(), unique_labels.data(), model.n_classes, stream);
model.unique_labels = unique_labels.release()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model.unique_labels is just a ptr to the buffer (e.g. owned by the python code), so nobody would be responsible to dealloc whatever we extract from the device_uvector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it

Comment on lines +199 to +200
*dual_coefs = (double*)rmm_alloc.allocate_async(
model.dual_coefs->size(), rmm::CUDA_ALLOCATION_ALIGNMENT, stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #6057 by @harrism , these raw calls to allocate need to go away. Should we change this API to assume that the user is passing in pre-allocated memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user does not know the size of the arrays in advance, so we would need to change the C-API to rely on resizable objects (just like the C++ API utilizes the device_buffer).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm that is a tricky situation. @harrism any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a device_buffer?

Copy link
Member

@harrism harrism Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is a C API?! My solution would be to not have a C API! (Why is a C API in a .cpp file?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C API source code is written in .cpp and we just export the API in .h header via extern C

int* indptr
int* indices
math_t* data
device_buffer* indptr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these raw pointers? Why doesn't the class own the buffers?

For that matter, why doesn't the class have a constructor and a destructor (especially if it contains raw pointers!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Tech Debt Issues related to debt
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants