Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quiver multi-node ogbn-mag240m benchmark #134

Open
prenc opened this issue Sep 21, 2022 · 0 comments
Open

Quiver multi-node ogbn-mag240m benchmark #134

prenc opened this issue Sep 21, 2022 · 0 comments

Comments

@prenc
Copy link

prenc commented Sep 21, 2022

I installed branch 0.2.0 within a conda env:

  • python 3.9
  • pytorch 1.12.1
  • cuda 10.2.

As a sanity check, I can run examples/reddit-quiver.py and it works without any errors.
However, when I want to run benchmarks/ogbn-mag240m/train_quiver_multi_node.py then I come across many problems. So far, I have been using a single node with a GPU, and I changed preprocessing.py accordingly (preprocess('data/mag', host=0, host_size=1, p2p_group=1, p2p_size=1) as well as cache_policy in the benchmarks itself to the default one.

And this is what I get:

  • warnings that the socket cannot be initialized, should not be important since I want to use a single node?
  • then some libs problems - libibverbs: Could not locate libibgni
  • and the error itself: CUDA error: all CUDA-capable devices are busy or unavailable

I am stuck on this, does anybody know what might be the reasons for the error? Is it even possible to run the benchmark on a single node? If not, what prevents it?

The benchmark output:

Namespace(hidden_channels=1024, batch_size=1024, dropout=0.5, epochs=100, model='graphsage', sizes=[25, 15], in_memory=False, device='0', evaluate=False, host_size=1, local_size=1, host=0)
Global seed set to 42
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:19216 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [nid02085]:19216 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [nid02085]:19216 (errno: 97 - Address family not supported by protocol).
libibverbs: Could not locate libibgni (/usr/lib64/libibgni.so.1: undefined symbol: verbs_uninit_context)
libibverbs: Warning: couldn't open config directory '/opt/cray/rdma-core/27.1-7.0.3.1_4.6__g4beae6eb.ari/etc/libibverbs.d'.
MAG240: Reading the dataset... LOG >>> Memory Budge On 0 is 4095 MB
feat init 2.8276915550231934
Dataloader set up! [35.83s]
Let's use 1 GPUs!
0 beg
Traceback (most recent call last):
  File "/scratch/snx3000/prenc/torch-quiver/benchmarks/ogbn-mag240m/train_quiver_multi_node.py", line 426, in <module>
    mp.spawn(run,
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/scratch/snx3000/prenc/torch-quiver/benchmarks/ogbn-mag240m/train_quiver_multi_node.py", line 302, in run
    model = GNN(args.model,
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant