Replies: 2 comments 14 replies
-
抱歉给你带来困扰,按照习惯默认用wandb来记录了,如果wandb比较麻烦话,您可以有以下两种解决办法:
accelerator的api文档在这里: 如果帮助到您,欢迎给RAG-Retrieval一个个star。 |
Beta Was this translation helpful? Give feedback.
1 reply
-
貌似是最终保存模型的时候,accelerate版本内部的bug(不过模型已经在epoch结束后保存了,可以看下),为了避免这个bug,可以尝试降低accelerate版本 0.34.0。 |
Beta Was this translation helpful? Give feedback.
13 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: ERROR api_key not configured (no-tty). call wandb.login(key=[your_api_key])
Traceback (most recent call last):
File "train_embedding.py", line 190, in
main()
File "train_embedding.py", line 85, in main
accelerator.init_trackers('embedding', config=vars(args))
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/accelerator.py", line 734, in _inner
return PartialState().on_main_process(function)(*args, **kwargs)
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/accelerator.py", line 2701, in init_trackers
self.trackers.append(tracker_init(project_name, **init_kwargs.get(str(tracker), {})))
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/tracking.py", line 81, in execute_on_main_process
return function(self, *args, **kwargs)
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/tracking.py", line 298, in init
self.run = wandb.init(project=self.run_name, **kwargs)
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1270, in init
wandb._sentry.reraise(e)
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/wandb/analytics/sentry.py", line 161, in reraise
raise exc.with_traceback(sys.exc_info()[2])
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1255, in init
wi.setup(kwargs)
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 305, in setup
wandb_login._login(
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 347, in _login
wlogin.prompt_api_key()
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 281, in prompt_api_key
raise UsageError("api_key not configured (no-tty). call " + directive)
wandb.errors.errors.UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key])
[rank0]: Traceback (most recent call last):
[rank0]: File "train_embedding.py", line 190, in
[rank0]: main()
[rank0]: File "train_embedding.py", line 85, in main
[rank0]: accelerator.init_trackers('embedding', config=vars(args))
[rank0]: File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/accelerator.py", line 734, in _inner
[rank0]: return PartialState().on_main_process(function)(*args, **kwargs)
[rank0]: File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/accelerator.py", line 2701, in init_trackers
[rank0]: self.trackers.append(tracker_init(project_name, **init_kwargs.get(str(tracker), {})))
[rank0]: File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/tracking.py", line 81, in execute_on_main_process
[rank0]: return function(self, *args, **kwargs)
[rank0]: File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/tracking.py", line 298, in init
[rank0]: self.run = wandb.init(project=self.run_name, **kwargs)
[rank0]: File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1270, in init
[rank0]: wandb._sentry.reraise(e)
[rank0]: File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/wandb/analytics/sentry.py", line 161, in reraise
[rank0]: raise exc.with_traceback(sys.exc_info()[2])
[rank0]: File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1255, in init
[rank0]: wi.setup(kwargs)
[rank0]: File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 305, in setup
[rank0]: wandb_login._login(
[rank0]: File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 347, in _login
[rank0]: wlogin.prompt_api_key()
[rank0]: File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 281, in prompt_api_key
[rank0]: raise UsageError("api_key not configured (no-tty). call " + directive)
[rank0]: wandb.errors.errors.UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key])
E1103 23:47:34.393793 139994886894208 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3625675) of binary: /data/liuxiang/anaconda3/envs/rag-retrieval/bin/python
Traceback (most recent call last):
File "/data/liuxiang/anaconda3/envs/rag-retrieval/bin/accelerate", line 8, in
sys.exit(main())
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1155, in launch_command
multi_gpu_launcher(args)
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/liuxiang/anaconda3/envs/rag-retrieval/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_embedding.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-11-03_23:47:34
host : tang3
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3625675)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
readme文档中没有提到过要配置api,我不知道在哪里取消它。我也不需要什么可视化
Beta Was this translation helpful? Give feedback.
All reactions