Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DPA-2 Training is super slow because of “DEEPMD WARNING Data loading buffer is empty or nearly empty“ #3951

Open
Manyi-Yang opened this issue Jul 5, 2024 · 2 comments

Comments

@Manyi-Yang
Copy link

Bug summary

When run DPA-2 (dp_tf) train calculation with Deepkit-V3.0.0a0, we always got following WARNINGS:

[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: trn: rmse = 9.27e+00, rmse_e = 5.18e-02, rmse_f = 3.20e-01, lr = 1.68e-04
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: val: rmse = 2.10e+01, rmse_e = 2.09e-01, rmse_f = 7.22e-01
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: total wall time = 867.78 s
[2024-07-05 11:04:32,527] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: trn: rmse = 8.23e+00, rmse_e = 1.09e-01, rmse_f = 2.84e-01, lr = 1.68e-04
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: val: rmse = 8.26e+00, rmse_e = 2.24e-01, rmse_f = 2.84e-01
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: total wall time = 948.59 s

This means that loading the data takes up a lot of time during training, which makes my training very slow.
I would like to know whether we have solutions to fix this problem.

By the way: keywords "stat_file": "./dpa2 was used.

DeePMD-kit Version

V3.0.0a0

Backend and its version

deepmd-kit.3.0_cuda123/lib/python3.11

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: trn: rmse = 9.27e+00, rmse_e = 5.18e-02, rmse_f = 3.20e-01, lr = 1.68e-04
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: val: rmse = 2.10e+01, rmse_e = 2.09e-01, rmse_f = 7.22e-01
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: total wall time = 867.78 s
[2024-07-05 11:04:32,527] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: trn: rmse = 8.23e+00, rmse_e = 1.09e-01, rmse_f = 2.84e-01, lr = 1.68e-04
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: val: rmse = 8.26e+00, rmse_e = 2.24e-01, rmse_f = 2.84e-01
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: total wall time = 948.59 s
[2024-07-05 11:19:32,806] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:30:11,391] DEEPMD INFO batch 24000: trn: rmse = 1.21e+01, rmse_e = 1.30e-01, rmse_f = 4.21e-01, lr = 1.65e-04
[2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: val: rmse = 9.12e+00, rmse_e = 1.42e-01, rmse_f = 3.18e-01
[2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: total wall time = 863.67 s
{ "_comment": "that's all", "model": { "type_map": [ "H", "C", "N", "O" ], "descriptor": { "type": "dpa2", "tebd_dim": 8, "repinit_rcut": 7.0, "repinit_rcut_smth": 6.0, "repinit_nsel": 100, "repformer_rcut": 4.0, "repformer_rcut_smth": 3.5, "repformer_nsel": 40, "repinit_neuron": [ 25, 50, 100 ], "repinit_axis_neuron": 12, "repinit_activation": "tanh", "repformer_nlayers": 12, "repformer_g1_dim": 128, "repformer_g2_dim": 32, "repformer_attn2_hidden": 32, "repformer_attn2_nhead": 4, "repformer_attn1_hidden": 128, "repformer_attn1_nhead": 4, "repformer_axis_dim": 4, "repformer_update_h2": false, "repformer_update_g1_has_conv": true, "repformer_update_g1_has_grrg": true, "repformer_update_g1_has_drrd": true, "repformer_update_g1_has_attn": true, "repformer_update_g2_has_g1g1": true, "repformer_update_g2_has_attn": true, "repformer_attn2_has_gate": true, "repformer_add_type_ebd_to_seq": false },

Steps to Reproduce

mpirun -np 4 dp --pt train --skip-neighbor-stat --mpi-log=master input.json

Further Information, Files, and Links

No response

@Manyi-Yang Manyi-Yang added the bug label Jul 5, 2024
@njzjz
Copy link
Member

njzjz commented Jul 5, 2024

I don't think MPI training has been supported by the PyTorch backend. Please read the documentation.

@njzjz
Copy link
Member

njzjz commented Aug 21, 2024

I don't think MPI training has been supported by the PyTorch backend. Please read the documentation.

To clarify, MPI is supported by the PyTorch DDP, but PyTorch needs to be compiled with MPI. Also, the nccl backend is hard-coded here.

dist.init_process_group(backend="nccl")

@njzjz njzjz added enhancement and removed Docs labels Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants