Bugs Fixing and Supporting for Multi-nodes #79

WangWenhao0716 · 2024-04-01T10:36:19Z

Thanks for this excellent work.

When I try to run it on multi-nodes with 16 GPUs, there is an error:

Traceback (most recent call last):
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 271, in <module>
    main(args)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 208, in main
    loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 97, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/gaussian_diffusion.py", line 747, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 130, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1103, in _run_ddp_forward
    inputs, kwargs = _to_kwargs(
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 100, in _to_kwargs
    _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 92, in _recursive_to
    res = to_map(inputs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 83, in to_map
    return list(zip(*map(to_map, obj)))
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 65, in to_map
    stream = _get_stream(target_gpu)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 122, in_get_stream
    if _streams[device] is None:
IndexError: list index out of range

After investigating, I find it comes from the 149th line in the train.py:

model = DDP(model.to(device), device_ids=[rank])

It should be:

model = DDP(model.to(device), device_ids=[device])

Then everything work normally.

The text was updated successfully, but these errors were encountered:

hustzyj · 2024-06-09T05:13:30Z

Thanks for this excellent work.

When I try to run it on multi-nodes with 16 GPUs, there is an error:

Traceback (most recent call last):
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 271, in <module>
    main(args)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 208, in main
    loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 97, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/gaussian_diffusion.py", line 747, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 130, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1103, in _run_ddp_forward
    inputs, kwargs = _to_kwargs(
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 100, in _to_kwargs
    _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 92, in _recursive_to
    res = to_map(inputs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 83, in to_map
    return list(zip(*map(to_map, obj)))
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 65, in to_map
    stream = _get_stream(target_gpu)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 122, in_get_stream
    if _streams[device] is None:
IndexError: list index out of range

After investigating, I find it comes from the 149th line in the train.py:

model = DDP(model.to(device), device_ids=[rank])

It should be:

model = DDP(model.to(device), device_ids=[device])

Then everything work normally.

Hi,

Thanks for this excellent work.

When I try to run it on multi-nodes with 16 GPUs, there is an error:

Traceback (most recent call last):
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 271, in <module>
    main(args)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 208, in main
    loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 97, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/gaussian_diffusion.py", line 747, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 130, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1103, in _run_ddp_forward
    inputs, kwargs = _to_kwargs(
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 100, in _to_kwargs
    _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 92, in _recursive_to
    res = to_map(inputs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 83, in to_map
    return list(zip(*map(to_map, obj)))
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 65, in to_map
    stream = _get_stream(target_gpu)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 122, in_get_stream
    if _streams[device] is None:
IndexError: list index out of range

After investigating, I find it comes from the 149th line in the train.py:

model = DDP(model.to(device), device_ids=[rank])

It should be:

model = DDP(model.to(device), device_ids=[device])

Then everything work normally.

Hi, I have some problems when I train with Multi-nodes,

WangWenhao0716 · 2024-06-09T05:15:17Z

The information is too little to know what happened.

hustzyj · 2024-06-09T05:30:42Z

The information is too little to know what happened.
Hi, when I train with multiple nodes, the code gets stuck at dist.init_process_group(‘nccl’) and can't proceed to the next step of training, the exact message is in the image above, but I don't have this problem when I use one node. My multi-nodes command is like this srun -N 4 --gres=gpu:v100:2 -p gpu python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 train.py --model DiT-XL/2 --data- path . /ImageNet/train。
I'm not sure what the problem is, is it that the command doesn't add some information that causes the multi-node training to fail

WangWenhao0716 · 2024-06-09T05:35:15Z

The information is too little to know what happened.
Hi, when I train with multiple nodes, the code gets stuck at dist.init_process_group(‘nccl’) and can't proceed to the next step of training, the exact message is in the image above, but I don't have this problem when I use one node. My multi-nodes command is like this srun -N 4 --gres=gpu:v100:2 -p gpu python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 train.py --model DiT-XL/2 --data- path . /ImageNet/train。
I'm not sure what the problem is, is it that the command doesn't add some information that causes the multi-node training to fail

Not sure whether it is because Srun or PyTorch. Please do not use slurm for training first

hustzyj · 2024-06-09T05:44:02Z

The information is too little to know what happened.
Hi, when I train with multiple nodes, the code gets stuck at dist.init_process_group(‘nccl’) and can't proceed to the next step of training, the exact message is in the image above, but I don't have this problem when I use one node. My multi-nodes command is like this srun -N 4 --gres=gpu:v100:2 -p gpu python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 train.py --model DiT-XL/2 --data- path . /ImageNet/train。
I'm not sure what the problem is, is it that the command doesn't add some information that causes the multi-node training to fail

Not sure whether it is because Srun or PyTorch. Please do not use slurm for training first
May I ask what your instructions are for using multi-node

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugs Fixing and Supporting for Multi-nodes #79

Bugs Fixing and Supporting for Multi-nodes #79

WangWenhao0716 commented Apr 1, 2024

hustzyj commented Jun 9, 2024

WangWenhao0716 commented Jun 9, 2024

hustzyj commented Jun 9, 2024

WangWenhao0716 commented Jun 9, 2024

hustzyj commented Jun 9, 2024

Bugs Fixing and Supporting for Multi-nodes #79

Bugs Fixing and Supporting for Multi-nodes #79

Comments

WangWenhao0716 commented Apr 1, 2024

hustzyj commented Jun 9, 2024

WangWenhao0716 commented Jun 9, 2024

hustzyj commented Jun 9, 2024

WangWenhao0716 commented Jun 9, 2024

hustzyj commented Jun 9, 2024