Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs Fixing and Supporting for Multi-nodes #79

Open
WangWenhao0716 opened this issue Apr 1, 2024 · 5 comments
Open

Bugs Fixing and Supporting for Multi-nodes #79

WangWenhao0716 opened this issue Apr 1, 2024 · 5 comments

Comments

@WangWenhao0716
Copy link

Thanks for this excellent work.

When I try to run it on multi-nodes with 16 GPUs, there is an error:

Traceback (most recent call last):
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 271, in <module>
    main(args)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 208, in main
    loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 97, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/gaussian_diffusion.py", line 747, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 130, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1103, in _run_ddp_forward
    inputs, kwargs = _to_kwargs(
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 100, in _to_kwargs
    _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 92, in _recursive_to
    res = to_map(inputs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 83, in to_map
    return list(zip(*map(to_map, obj)))
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 65, in to_map
    stream = _get_stream(target_gpu)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 122, in_get_stream
    if _streams[device] is None:
IndexError: list index out of range

After investigating, I find it comes from the 149th line in the train.py:

model = DDP(model.to(device), device_ids=[rank])

It should be:

model = DDP(model.to(device), device_ids=[device])

Then everything work normally.

@hustzyj
Copy link

hustzyj commented Jun 9, 2024

Thanks for this excellent work.

When I try to run it on multi-nodes with 16 GPUs, there is an error:

Traceback (most recent call last):
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 271, in <module>
    main(args)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 208, in main
    loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 97, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/gaussian_diffusion.py", line 747, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 130, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1103, in _run_ddp_forward
    inputs, kwargs = _to_kwargs(
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 100, in _to_kwargs
    _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 92, in _recursive_to
    res = to_map(inputs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 83, in to_map
    return list(zip(*map(to_map, obj)))
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 65, in to_map
    stream = _get_stream(target_gpu)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 122, in_get_stream
    if _streams[device] is None:
IndexError: list index out of range

After investigating, I find it comes from the 149th line in the train.py:

model = DDP(model.to(device), device_ids=[rank])

It should be:

model = DDP(model.to(device), device_ids=[device])

Then everything work normally.

Hi,

Thanks for this excellent work.

When I try to run it on multi-nodes with 16 GPUs, there is an error:

Traceback (most recent call last):
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 271, in <module>
    main(args)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/train.py", line 208, in main
    loss_dict = diffusion.training_losses(model, x, t, model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 97, in training_losses
    return super().training_losses(self._wrap_model(model), *args, **kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/gaussian_diffusion.py", line 747, in training_losses
    model_output = model(x_t, t, **model_kwargs)
  File "/gs/home/statchao/zdhu/code/DiT_wenhao/DiT/diffusion/respace.py", line 130, in __call__
    return self.model(x, new_ts, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1103, in _run_ddp_forward
    inputs, kwargs = _to_kwargs(
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 100, in _to_kwargs
    _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 92, in _recursive_to
    res = to_map(inputs)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 83, in to_map
    return list(zip(*map(to_map, obj)))
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/distributed/utils.py", line 65, in to_map
    stream = _get_stream(target_gpu)
  File "/gs/home/statchao/anaconda3/envs/torch2/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 122, in_get_stream
    if _streams[device] is None:
IndexError: list index out of range

After investigating, I find it comes from the 149th line in the train.py:

model = DDP(model.to(device), device_ids=[rank])

It should be:

model = DDP(model.to(device), device_ids=[device])

Then everything work normally.

Hi, I have some problems when I train with Multi-nodes,
屏幕截图 2024-06-09 131303

@WangWenhao0716
Copy link
Author

The information is too little to know what happened.

@hustzyj
Copy link

hustzyj commented Jun 9, 2024

The information is too little to know what happened.
Hi, when I train with multiple nodes, the code gets stuck at dist.init_process_group(‘nccl’) and can't proceed to the next step of training, the exact message is in the image above, but I don't have this problem when I use one node. My multi-nodes command is like this srun -N 4 --gres=gpu:v100:2 -p gpu python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 train.py --model DiT-XL/2 --data- path . /ImageNet/train。
I'm not sure what the problem is, is it that the command doesn't add some information that causes the multi-node training to fail

@WangWenhao0716
Copy link
Author

The information is too little to know what happened.
Hi, when I train with multiple nodes, the code gets stuck at dist.init_process_group(‘nccl’) and can't proceed to the next step of training, the exact message is in the image above, but I don't have this problem when I use one node. My multi-nodes command is like this srun -N 4 --gres=gpu:v100:2 -p gpu python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 train.py --model DiT-XL/2 --data- path . /ImageNet/train。
I'm not sure what the problem is, is it that the command doesn't add some information that causes the multi-node training to fail

Not sure whether it is because Srun or PyTorch. Please do not use slurm for training first

@hustzyj
Copy link

hustzyj commented Jun 9, 2024

The information is too little to know what happened.
Hi, when I train with multiple nodes, the code gets stuck at dist.init_process_group(‘nccl’) and can't proceed to the next step of training, the exact message is in the image above, but I don't have this problem when I use one node. My multi-nodes command is like this srun -N 4 --gres=gpu:v100:2 -p gpu python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 train.py --model DiT-XL/2 --data- path . /ImageNet/train。
I'm not sure what the problem is, is it that the command doesn't add some information that causes the multi-node training to fail

Not sure whether it is because Srun or PyTorch. Please do not use slurm for training first
May I ask what your instructions are for using multi-node

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants