Quiver multi-node ogbn-mag240m benchmark #134

prenc · 2022-09-21T20:43:44Z

I installed branch 0.2.0 within a conda env:

python 3.9
pytorch 1.12.1
cuda 10.2.

As a sanity check, I can run examples/reddit-quiver.py and it works without any errors.
However, when I want to run benchmarks/ogbn-mag240m/train_quiver_multi_node.py then I come across many problems. So far, I have been using a single node with a GPU, and I changed preprocessing.py accordingly (preprocess('data/mag', host=0, host_size=1, p2p_group=1, p2p_size=1) as well as cache_policy in the benchmarks itself to the default one.

And this is what I get:

warnings that the socket cannot be initialized, should not be important since I want to use a single node?
then some libs problems - libibverbs: Could not locate libibgni
and the error itself: CUDA error: all CUDA-capable devices are busy or unavailable

I am stuck on this, does anybody know what might be the reasons for the error? Is it even possible to run the benchmark on a single node? If not, what prevents it?

The benchmark output:

Namespace(hidden_channels=1024, batch_size=1024, dropout=0.5, epochs=100, model='graphsage', sizes=[25, 15], in_memory=False, device='0', evaluate=False, host_size=1, local_size=1, host=0)
Global seed set to 42
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:19216 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [nid02085]:19216 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [nid02085]:19216 (errno: 97 - Address family not supported by protocol).
libibverbs: Could not locate libibgni (/usr/lib64/libibgni.so.1: undefined symbol: verbs_uninit_context)
libibverbs: Warning: couldn't open config directory '/opt/cray/rdma-core/27.1-7.0.3.1_4.6__g4beae6eb.ari/etc/libibverbs.d'.
MAG240: Reading the dataset... LOG >>> Memory Budge On 0 is 4095 MB
feat init 2.8276915550231934
Dataloader set up! [35.83s]
Let's use 1 GPUs!
0 beg
Traceback (most recent call last):
  File "/scratch/snx3000/prenc/torch-quiver/benchmarks/ogbn-mag240m/train_quiver_multi_node.py", line 426, in <module>
    mp.spawn(run,
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/scratch/snx3000/prenc/torch-quiver/benchmarks/ogbn-mag240m/train_quiver_multi_node.py", line 302, in run
    model = GNN(args.model,
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quiver multi-node ogbn-mag240m benchmark #134

Quiver multi-node ogbn-mag240m benchmark #134

prenc commented Sep 21, 2022 •

edited

Loading

Quiver multi-node ogbn-mag240m benchmark #134

Quiver multi-node ogbn-mag240m benchmark #134

Comments

prenc commented Sep 21, 2022 • edited Loading

prenc commented Sep 21, 2022 •

edited

Loading