RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0. #189

HamzaBenHaj · 2024-05-27T16:51:44Z

Hey everyone,

I am trying to get acquainted with UniAD and followed the instruction but when I tried to run the evaluation example:

./tools/uniad_dist_eval.sh ./projects/configs/stage1_track_map/base_track_map.py ./ckpts/uniad_base_track_map.pth 4

I receive the following error

Traceback (most recent call last):
File "./tools/test.py", line 261, in
main()
File "./tools/test.py", line 227, in main
model = MMDistributedDataParallel(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801258 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801745 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801743 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801745 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801743 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 232910) of binary: /home/hammar/miniconda3/envs/uniad/bin/python
Traceback (most recent call last):
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I tried then to run the train example (I can only use 4 GPUs):

./tools/uniad_dist_train.sh ./projects/configs/stage1_track_map/base_track_map.py 4

but same error:

Traceback (most recent call last):
File "./tools/train.py", line 256, in
Traceback (most recent call last):
File "./tools/train.py", line 256, in
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800494 milliseconds before timing out.
main()
File "./tools/train.py", line 245, in main
custom_train_model(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/train.py", line 21, in custom_train_model
main()
File "./tools/train.py", line 245, in main
custom_train_detector(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/mmdet_train.py", line 70, in custom_train_detector
custom_train_model(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/train.py", line 21, in custom_train_model
model = MMDistributedDataParallel(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init
custom_train_detector(
File "/home/hblab/UniAD/projects/mmdet3d_plugin/uniad/apis/mmdet_train.py", line 70, in custom_train_detector
model = MMDistributedDataParallel(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 496, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0.
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800524 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800471 milliseconds before timing out.

Loading NuScenes tables for version v1.0-trainval...
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 346625) of binary: /home/hblab/miniconda3/envs/uniad/bin/python
Traceback (most recent call last):
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hblab/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Anyone encountered it before?

Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0. #189

RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0. #189

HamzaBenHaj commented May 27, 2024

RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0. #189

RuntimeError: replicas[0][0] in this process with sizes [200, 128] appears not to match sizes of the same param in process 0. #189

Comments

HamzaBenHaj commented May 27, 2024