Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run quiver on server with complex GPU topology? #135

Open
JIESUN233 opened this issue Oct 18, 2022 · 0 comments
Open

How to run quiver on server with complex GPU topology? #135

JIESUN233 opened this issue Oct 18, 2022 · 0 comments

Comments

@JIESUN233
Copy link

JIESUN233 commented Oct 18, 2022

Hi, I want to run quiver's p2p_clique_replicate cache policy on a single server with 4 A100 GPUs. The GPU topology are as follows:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X NV12 PXB PXB 0-25,52-77 0
GPU1 NV12 X PXB PXB 0-25,52-77 0
GPU2 PXB PXB X NV12 0-25,52-77 0
GPU3 PXB PXB NV12 X 0-25,52-77 0
There are NVLinks between GPU 0,1 and GPU 2,3.

According to the documentation, there are two cliques(GPU 0,1 and GPU2,3). The cache should be replicate over two cliques. But I found the cache seems to distribute over 4GPUs.
Here is my code(dist_sampling_ogb_reddit_quiver.py, Reddit dataset, feature 500MB):
quiver.init_p2p(device_list=list(range(world_size)))
quiver_feature = quiver.Feature(rank=0, device_list=list(range(world_size)), device_cache_size="0.1G", cache_policy="p2p_clique_replicate", csr_topo=csr_topo)
Theses are what I got:
[0, 1, 2, 3]
LOG>>> P2P Access Initilization
Enable P2P Access Between 0 <---> 1
Enable P2P Access Between 0 <---> 2
Enable P2P Access Between 0 <---> 3
Enable P2P Access Between 1 <---> 2
Enable P2P Access Between 1 <---> 3
Enable P2P Access Between 2 <---> 3
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
LOG>>> 76% data cached
LOG>>> GPU [0, 1, 2, 3] belong to the same NUMA Domain
LOG >>> Memory Budge On 0 is 102 MB
LOG >>> Memory Budge On 1 is 102 MB
LOG >>> Memory Budge On 2 is 102 MB
LOG >>> Memory Budge On 3 is 102 MB
Let's use 4 GPUs!
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
WARNING: You are using p2p_clique_replicate mode, MAKE SURE you have called quiver.init_p2p() to enable p2p access
Epoch: 019, Epoch Time: 0.5197241902351379

So I wonder if there is a solution to enable p2p_clique_replicate on my 4 GPU server.
Thanks~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant