-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用openke2.0中的train_rotate_FB15K237_dist.py进行分布式训练时报错 #410
Comments
上面问题解决了,是由于我的数据有误,但分布式训练又遇到新问题,分布式只有一张卡工作,但另一张卡也是gpu满的。 warnings.warn( Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. Input Files Path : ./benchmarks/data-390/ 以下是nvidi-smi使用情况: +-----------------------------------------------------------------------------------------+ |
你好,我在使用openke2.0中的train_rotate_FB15K237_dist.py时出现以下报错,请问有什么解决办法吗?非常希望得到帮助。
Input Files Path : ./benchmarks/data-390/
The toolkit is importing datasets.
The total of relations is 28.
The total of entities is 700324.
Input Files Path : ./benchmarks/data-390/
The toolkit is importing datasets.
The total of relations is 28.
The total of entities is 700324.
The total of train triples is 2849846.
The total of train triples is 2849846.
Input Files Path : ./benchmarks/data-390/
Input Files Path : ./benchmarks/data-390/
The total of test triples is 258713.
The total of valid triples is 1293564.
The total of test triples is 258713.
The total of valid triples is 1293564.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 2646564) of binary: /home/jupyter-xingcheng/.conda/envs/openke/bin/python3.8
Traceback (most recent call last):
File "/home/jupyter-xingcheng/.conda/envs/openke/bin/torchrun", line 8, in
sys.exit(main())
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_rotate_data_390_dist.py FAILED
Failures:
[1]:
time : 2024-06-17_13:53:46
host : dell
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 2646565)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2646565
Root Cause (first observed failure):
[0]:
time : 2024-06-17_13:53:46
host : dell
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 2646564)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2646564
运行的命令是:WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port 1234 train_rotate_data_390_dist.py
The text was updated successfully, but these errors were encountered: