How to run quiver on server with complex GPU topology? #135

JIESUN233 · 2022-10-18T08:34:20Z

Hi, I want to run quiver's p2p_clique_replicate cache policy on a single server with 4 A100 GPUs. The GPU topology are as follows:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X NV12 PXB PXB 0-25,52-77 0
GPU1 NV12 X PXB PXB 0-25,52-77 0
GPU2 PXB PXB X NV12 0-25,52-77 0
GPU3 PXB PXB NV12 X 0-25,52-77 0
There are NVLinks between GPU 0,1 and GPU 2,3.

According to the documentation, there are two cliques(GPU 0,1 and GPU2,3). The cache should be replicate over two cliques. But I found the cache seems to distribute over 4GPUs.
Here is my code(dist_sampling_ogb_reddit_quiver.py, Reddit dataset, feature 500MB):
quiver.init_p2p(device_list=list(range(world_size)))
quiver_feature = quiver.Feature(rank=0, device_list=list(range(world_size)), device_cache_size="0.1G", cache_policy="p2p_clique_replicate", csr_topo=csr_topo)
Theses are what I got:
[0, 1, 2, 3]
LOG>>> P2P Access Initilization
Enable P2P Access Between 0 <---> 1
Enable P2P Access Between 0 <---> 2
Enable P2P Access Between 0 <---> 3
Enable P2P Access Between 1 <---> 2
Enable P2P Access Between 1 <---> 3
Enable P2P Access Between 2 <---> 3
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
LOG>>> 76% data cached
LOG>>> GPU [0, 1, 2, 3] belong to the same NUMA Domain
LOG >>> Memory Budge On 0 is 102 MB
LOG >>> Memory Budge On 1 is 102 MB
LOG >>> Memory Budge On 2 is 102 MB
LOG >>> Memory Budge On 3 is 102 MB
Let's use 4 GPUs!
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
Epoch: 019, Epoch Time: 0.5197241902351379

So I wonder if there is a solution to enable p2p_clique_replicate on my 4 GPU server.
Thanks~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run quiver on server with complex GPU topology? #135

How to run quiver on server with complex GPU topology? #135

JIESUN233 commented Oct 18, 2022 •

edited

Loading

How to run quiver on server with complex GPU topology? #135

How to run quiver on server with complex GPU topology? #135

Comments

JIESUN233 commented Oct 18, 2022 • edited Loading

JIESUN233 commented Oct 18, 2022 •

edited

Loading