support different entry size for different ranks #194

zhuofan1123 · 2024-07-22T06:59:05Z

Allow users to specify the entry size on each rank.

    node_feat_wm_embedding = wgth.create_embedding(
        ...
        embedding_entry_partition=[283071, 401722, 356680, 329221, 238065, 238060, 217897, 384313]
    )

embedding_entry_partition[i] indicates the number of embedding entries stored on the rank i.
If embedding_entry_partition is None, embedding will be partitioned equally.
Only chunked device and distributed host/device are supported.

copy-pr-bot · 2024-07-22T06:59:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

linhu-nv

Thanks for the great work. Some comments are added in the codes.

linhu-nv · 2024-07-23T06:34:59Z

python/pylibwholegraph/pylibwholegraph/torch/embedding.py

@@ -478,15 +518,35 @@ def create_embedding_from_filelist(
            )
        total_file_size += file_size
    total_entry_count = total_file_size // file_entry_size
+    if embedding_entry_partition is not None:


maybe we can omit this because similar check will always be done in create_embedding()

linhu-nv · 2024-07-23T06:38:43Z

python/pylibwholegraph/pylibwholegraph/torch/tensor.py

@@ -283,8 +311,27 @@ def create_wholememory_tensor_from_filelist(
    else:
        sizes = [total_entry_count, last_dim_size]
        strides = [last_dim_strides, 1]
+    if tensor_entry_partition is not None:


similarly, I think this could be omitted too, since we will do the check in create_wholememory_tensor()

linhu-nv · 2024-07-23T06:50:00Z

python/pylibwholegraph/pylibwholegraph/binding/wholememory_binding.pyx

+    cdef wholememory_error_code_t wholememory_tensor_get_entry_offsets(
+        size_t * entry_offsets, wholememory_tensor_t wholememory_tensor);
+
+    cdef wholememory_error_code_t wholememory_tensor_get_entry_partition(


perhaps this function should be named as wholememory_tensor_get_entry_partition_sizes(), similar to that for wholememory embedding

linhu-nv · 2024-07-23T08:10:08Z

cpp/src/wholememory/memory_handle.cpp

      if (mem_size_for_current_rank > 0) {
        void* ptr = nvshmem_ptr(nvshmem_memory_handle_.local_alloc_mem_ptr, i);
        if (ptr != nullptr) {
          register_wholememory_vma_range_locked(ptr, mem_size_for_current_rank, handle_);
        }
+        ptr = nvshmem_ptr(nvshmem_memory_handle_.local_alloc_mem_ptr, 1);


Could you explain why we need this line?

linhu-nv · 2024-07-23T10:20:31Z

cpp/src/wholememory_ops/gather_op_impl_nccl.cu

+
+    WHOLEMEMORY_RETURN_ON_FAIL(
+      wholememory_get_rank_partition_offsets(host_embedding_entry_offsets_ptr, wholememory_handle));
+    for (int i = 0; i < world_size + 1; i++) {


perhaps each process only need to do check of local memory

linhu-nv · 2024-07-23T10:22:43Z

cpp/src/wholememory_ops/gather_op_impl_nccl.cu

+      host_embedding_entry_offsets_ptr[i] /= embedding_entry_size;
+    }
+
+    WM_CUDA_CHECK(cudaMemcpy(dev_embedding_entry_offsets_ptr,


we need to put cudaMemcpy() into $stream, instead of default stream. There are many calls to this function.

linhu-nv · 2024-07-23T10:26:17Z

cpp/src/wholememory_ops/gather_op_impl_nvshmem.cu

+
+    size_t element_size         = wholememory_dtype_get_element_size(wholememory_desc.dtype);
+    size_t embedding_entry_size = element_size * wholememory_desc.stride;
+    for (int i = 0; i < world_size + 1; i++) {


Again, maybe only check own rank is enough.

linhu-nv · 2024-07-23T10:27:09Z

cpp/src/wholememory_ops/scatter_op_impl.nvshmem.cu

+
+    size_t element_size         = wholememory_dtype_get_element_size(wholememory_desc.dtype);
+    size_t embedding_entry_size = element_size * wholememory_desc.stride;
+    for (int i = 0; i < world_size + 1; i++) {


Only check one rank.

linhu-nv · 2024-07-23T10:39:57Z

cpp/tests/wholememory_ops/wholememory_embedding_tests.cu

@@ -238,13 +238,16 @@ TEST_P(WholeMemoryEmbeddingParameterTests, EmbeddingGatherTest)
    wholememory_tensor_description_t embedding_tensor_description;
    wholememory_copy_matrix_desc_to_tensor(&embedding_tensor_description,
                                           &params.embedding_description);
-
+    std::vector<size_t> rank_partition(world_size);


perhaps we should also allow "random" and "default" partition, just like pytest.

linhu-nv · 2024-07-30T06:10:29Z

Seems good to me.

BradReesWork · 2024-07-30T12:43:02Z

/okay to test

BradReesWork · 2024-07-30T13:37:00Z

/okay to test

BradReesWork · 2024-07-31T13:25:26Z

/okay to test

BradReesWork · 2024-08-02T12:24:57Z

/merge

BradReesWork · 2024-08-02T12:26:19Z

/okay to test

zhuofan1123 force-pushed the support_different_sizes branch from 64cfb03 to 35f192b Compare July 23, 2024 06:29

linhu-nv reviewed Jul 23, 2024

View reviewed changes

zhuofan1123 force-pushed the support_different_sizes branch 4 times, most recently from 1b6dd93 to 83fd16c Compare July 30, 2024 05:55

BradReesWork added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jul 30, 2024

zhuofan1123 force-pushed the support_different_sizes branch from 03c4bf3 to e550bc3 Compare July 30, 2024 13:25

zhuofan1123 force-pushed the support_different_sizes branch from e550bc3 to 624c565 Compare July 31, 2024 05:15

linhu-nv approved these changes Aug 1, 2024

View reviewed changes

zhuofan1123 added 2 commits August 2, 2024 08:29

support different entry size for different ranks

b14440a

fix argument name in comment

e661c4f

zhuofan1123 force-pushed the support_different_sizes branch from 624c565 to e661c4f Compare August 2, 2024 08:30

zhuofan1123 requested review from a team as code owners August 2, 2024 08:30

zhuofan1123 requested a review from AyodeAwe August 2, 2024 08:30

zhuofan1123 changed the base branch from branch-24.08 to branch-24.10 August 2, 2024 08:36

BradReesWork approved these changes Aug 2, 2024

View reviewed changes

rapids-bot bot merged commit 91b7dcd into rapidsai:branch-24.10 Aug 2, 2024
48 checks passed

zhuofan1123 deleted the support_different_sizes branch August 26, 2024 03:49

jameslamb mentioned this pull request Oct 18, 2024

add full CI for wholegraph rapidsai/cugraph-gnn#58

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support different entry size for different ranks #194

support different entry size for different ranks #194

zhuofan1123 commented Jul 22, 2024

copy-pr-bot bot commented Jul 22, 2024

linhu-nv left a comment

linhu-nv Jul 23, 2024

linhu-nv Jul 23, 2024

linhu-nv Jul 23, 2024

linhu-nv Jul 23, 2024

linhu-nv Jul 23, 2024

linhu-nv Jul 23, 2024

linhu-nv Jul 23, 2024

linhu-nv Jul 23, 2024

linhu-nv Jul 23, 2024

linhu-nv commented Jul 30, 2024

BradReesWork commented Jul 30, 2024

BradReesWork commented Jul 30, 2024

BradReesWork commented Jul 31, 2024

BradReesWork commented Aug 2, 2024

BradReesWork commented Aug 2, 2024

support different entry size for different ranks #194

support different entry size for different ranks #194

Conversation

zhuofan1123 commented Jul 22, 2024

copy-pr-bot bot commented Jul 22, 2024

linhu-nv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linhu-nv commented Jul 30, 2024

BradReesWork commented Jul 30, 2024

BradReesWork commented Jul 30, 2024

BradReesWork commented Jul 31, 2024

BradReesWork commented Aug 2, 2024

BradReesWork commented Aug 2, 2024