You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I want to run quiver's p2p_clique_replicate cache policy on a single server with 4 A100 GPUs. The GPU topology are as follows:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X NV12 PXB PXB 0-25,52-77 0
GPU1 NV12 X PXB PXB 0-25,52-77 0
GPU2 PXB PXB X NV12 0-25,52-77 0
GPU3 PXB PXB NV12 X 0-25,52-77 0
There are NVLinks between GPU 0,1 and GPU 2,3.
According to the documentation, there are two cliques(GPU 0,1 and GPU2,3). The cache should be replicate over two cliques. But I found the cache seems to distribute over 4GPUs.
Here is my code(dist_sampling_ogb_reddit_quiver.py, Reddit dataset, feature 500MB): quiver.init_p2p(device_list=list(range(world_size))) quiver_feature = quiver.Feature(rank=0, device_list=list(range(world_size)), device_cache_size="0.1G", cache_policy="p2p_clique_replicate", csr_topo=csr_topo)
Theses are what I got:
[0, 1, 2, 3]
LOG>>> P2P Access Initilization
Enable P2P Access Between 0 <---> 1
Enable P2P Access Between 0 <---> 2
Enable P2P Access Between 0 <---> 3
Enable P2P Access Between 1 <---> 2
Enable P2P Access Between 1 <---> 3
Enable P2P Access Between 2 <---> 3
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
LOG>>> 76% data cached
LOG>>> GPU [0, 1, 2, 3] belong to the same NUMA Domain
LOG >>> Memory Budge On 0 is 102 MB
LOG >>> Memory Budge On 1 is 102 MB
LOG >>> Memory Budge On 2 is 102 MB
LOG >>> Memory Budge On 3 is 102 MB
Let's use 4 GPUs!
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
Epoch: 019, Epoch Time: 0.5197241902351379
So I wonder if there is a solution to enable p2p_clique_replicate on my 4 GPU server.
Thanks~
The text was updated successfully, but these errors were encountered:
Hi, I want to run quiver's p2p_clique_replicate cache policy on a single server with 4 A100 GPUs. The GPU topology are as follows:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X NV12 PXB PXB 0-25,52-77 0
GPU1 NV12 X PXB PXB 0-25,52-77 0
GPU2 PXB PXB X NV12 0-25,52-77 0
GPU3 PXB PXB NV12 X 0-25,52-77 0
There are NVLinks between GPU 0,1 and GPU 2,3.
According to the documentation, there are two cliques(GPU 0,1 and GPU2,3). The cache should be replicate over two cliques. But I found the cache seems to distribute over 4GPUs.
Here is my code(dist_sampling_ogb_reddit_quiver.py, Reddit dataset, feature 500MB):
quiver.init_p2p(device_list=list(range(world_size)))
quiver_feature = quiver.Feature(rank=0, device_list=list(range(world_size)), device_cache_size="0.1G", cache_policy="p2p_clique_replicate", csr_topo=csr_topo)
Theses are what I got:
[0, 1, 2, 3]
LOG>>> P2P Access Initilization
Enable P2P Access Between 0 <---> 1
Enable P2P Access Between 0 <---> 2
Enable P2P Access Between 0 <---> 3
Enable P2P Access Between 1 <---> 2
Enable P2P Access Between 1 <---> 3
Enable P2P Access Between 2 <---> 3
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
LOG>>> 76% data cached
LOG>>> GPU [0, 1, 2, 3] belong to the same NUMA Domain
LOG >>> Memory Budge On 0 is 102 MB
LOG >>> Memory Budge On 1 is 102 MB
LOG >>> Memory Budge On 2 is 102 MB
LOG >>> Memory Budge On 3 is 102 MB
Let's use 4 GPUs!
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
Epoch: 019, Epoch Time: 0.5197241902351379
So I wonder if there is a solution to enable p2p_clique_replicate on my 4 GPU server.
Thanks~
The text was updated successfully, but these errors were encountered: