You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a sanity check, I can run examples/reddit-quiver.py and it works without any errors.
However, when I want to run benchmarks/ogbn-mag240m/train_quiver_multi_node.py then I come across many problems. So far, I have been using a single node with a GPU, and I changed preprocessing.py accordingly (preprocess('data/mag', host=0, host_size=1, p2p_group=1, p2p_size=1) as well as cache_policy in the benchmarks itself to the default one.
And this is what I get:
warnings that the socket cannot be initialized, should not be important since I want to use a single node?
then some libs problems - libibverbs: Could not locate libibgni
and the error itself: CUDA error: all CUDA-capable devices are busy or unavailable
I am stuck on this, does anybody know what might be the reasons for the error? Is it even possible to run the benchmark on a single node? If not, what prevents it?
The benchmark output:
Namespace(hidden_channels=1024, batch_size=1024, dropout=0.5, epochs=100, model='graphsage', sizes=[25, 15], in_memory=False, device='0', evaluate=False, host_size=1, local_size=1, host=0)
Global seed set to 42
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:19216 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [nid02085]:19216 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [nid02085]:19216 (errno: 97 - Address family not supported by protocol).
libibverbs: Could not locate libibgni (/usr/lib64/libibgni.so.1: undefined symbol: verbs_uninit_context)
libibverbs: Warning: couldn't open config directory '/opt/cray/rdma-core/27.1-7.0.3.1_4.6__g4beae6eb.ari/etc/libibverbs.d'.
MAG240: Reading the dataset... LOG >>> Memory Budge On 0 is 4095 MB
feat init 2.8276915550231934
Dataloader set up! [35.83s]
Let's use 1 GPUs!
0 beg
Traceback (most recent call last):
File "/scratch/snx3000/prenc/torch-quiver/benchmarks/ogbn-mag240m/train_quiver_multi_node.py", line 426, in <module>
mp.spawn(run,
File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/scratch/snx3000/prenc/torch-quiver/benchmarks/ogbn-mag240m/train_quiver_multi_node.py", line 302, in run
model = GNN(args.model,
File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
return self._apply(convert)
File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
param_applied = fn(param)
File "/scratch/snx3000/prenc/torch-quiver/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 925, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
The text was updated successfully, but these errors were encountered:
I installed branch 0.2.0 within a conda env:
3.9
1.12.1
10.2
.As a sanity check, I can run
examples/reddit-quiver.py
and it works without any errors.However, when I want to run
benchmarks/ogbn-mag240m/train_quiver_multi_node.py
then I come across many problems. So far, I have been using a single node with a GPU, and I changedpreprocessing.py
accordingly (preprocess('data/mag', host=0, host_size=1, p2p_group=1, p2p_size=1)
as well as cache_policy in the benchmarks itself to the default one.And this is what I get:
I am stuck on this, does anybody know what might be the reasons for the error? Is it even possible to run the benchmark on a single node? If not, what prevents it?
The benchmark output:
The text was updated successfully, but these errors were encountered: