Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0. #189

Open
HamzaBenHaj opened this issue May 27, 2024 · 0 comments

Comments

@HamzaBenHaj
Copy link

Hey everyone,

I am trying to get acquainted with UniAD and followed the instruction but when I tried to run the evaluation example:

./tools/uniad_dist_eval.sh ./projects/configs/stage1_track_map/base_track_map.py ./ckpts/uniad_base_track_map.pth 4

I receive the following error

Traceback (most recent call last):
File "./tools/test.py", line 261, in
main()
File "./tools/test.py", line 227, in main
model = MMDistributedDataParallel(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801258 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801745 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801743 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801745 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801743 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 232910) of binary: /home/hammar/miniconda3/envs/uniad/bin/python
Traceback (most recent call last):
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I tried then to run the train example (I can only use 4 GPUs):

./tools/uniad_dist_train.sh ./projects/configs/stage1_track_map/base_track_map.py 4

but same error:

Traceback (most recent call last):
File "./tools/train.py", line 256, in
Traceback (most recent call last):
File "./tools/train.py", line 256, in
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800494 milliseconds before timing out.
main()
File "./tools/train.py", line 245, in main
custom_train_model(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/train.py", line 21, in custom_train_model
main()
File "./tools/train.py", line 245, in main
custom_train_detector(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/mmdet_train.py", line 70, in custom_train_detector
custom_train_model(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/train.py", line 21, in custom_train_model
model = MMDistributedDataParallel(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init
custom_train_detector(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/mmdet_train.py", line 70, in custom_train_detector
model = MMDistributedDataParallel(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0.
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800524 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800471 milliseconds before timing out.

Loading NuScenes tables for version v1.0-trainval...
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 346625) of binary: /home/hblab/miniconda3/envs/uniad/bin/python
Traceback (most recent call last):
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Anyone encountered it before?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant