[BUG] DPA-2 Training is super slow because of “DEEPMD WARNING Data loading buffer is empty or nearly empty“ #3951

Manyi-Yang · 2024-07-05T09:51:04Z

Bug summary

When run DPA-2 (dp_tf) train calculation with Deepkit-V3.0.0a0, we always got following WARNINGS:

[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: trn: rmse = 9.27e+00, rmse_e = 5.18e-02, rmse_f = 3.20e-01, lr = 1.68e-04
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: val: rmse = 2.10e+01, rmse_e = 2.09e-01, rmse_f = 7.22e-01
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: total wall time = 867.78 s
[2024-07-05 11:04:32,527] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: trn: rmse = 8.23e+00, rmse_e = 1.09e-01, rmse_f = 2.84e-01, lr = 1.68e-04
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: val: rmse = 8.26e+00, rmse_e = 2.24e-01, rmse_f = 2.84e-01
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: total wall time = 948.59 s

This means that loading the data takes up a lot of time during training, which makes my training very slow.
I would like to know whether we have solutions to fix this problem.

By the way: keywords "stat_file": "./dpa2 was used.

DeePMD-kit Version

V3.0.0a0

Backend and its version

deepmd-kit.3.0_cuda123/lib/python3.11

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: trn: rmse = 9.27e+00, rmse_e = 5.18e-02, rmse_f = 3.20e-01, lr = 1.68e-04
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: val: rmse = 2.10e+01, rmse_e = 2.09e-01, rmse_f = 7.22e-01
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: total wall time = 867.78 s
[2024-07-05 11:04:32,527] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: trn: rmse = 8.23e+00, rmse_e = 1.09e-01, rmse_f = 2.84e-01, lr = 1.68e-04
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: val: rmse = 8.26e+00, rmse_e = 2.24e-01, rmse_f = 2.84e-01
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: total wall time = 948.59 s
[2024-07-05 11:19:32,806] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:30:11,391] DEEPMD INFO batch 24000: trn: rmse = 1.21e+01, rmse_e = 1.30e-01, rmse_f = 4.21e-01, lr = 1.65e-04
[2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: val: rmse = 9.12e+00, rmse_e = 1.42e-01, rmse_f = 3.18e-01
[2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: total wall time = 863.67 s
{ "_comment": "that's all", "model": { "type_map": [ "H", "C", "N", "O" ], "descriptor": { "type": "dpa2", "tebd_dim": 8, "repinit_rcut": 7.0, "repinit_rcut_smth": 6.0, "repinit_nsel": 100, "repformer_rcut": 4.0, "repformer_rcut_smth": 3.5, "repformer_nsel": 40, "repinit_neuron": [ 25, 50, 100 ], "repinit_axis_neuron": 12, "repinit_activation": "tanh", "repformer_nlayers": 12, "repformer_g1_dim": 128, "repformer_g2_dim": 32, "repformer_attn2_hidden": 32, "repformer_attn2_nhead": 4, "repformer_attn1_hidden": 128, "repformer_attn1_nhead": 4, "repformer_axis_dim": 4, "repformer_update_h2": false, "repformer_update_g1_has_conv": true, "repformer_update_g1_has_grrg": true, "repformer_update_g1_has_drrd": true, "repformer_update_g1_has_attn": true, "repformer_update_g2_has_g1g1": true, "repformer_update_g2_has_attn": true, "repformer_attn2_has_gate": true, "repformer_add_type_ebd_to_seq": false },

Steps to Reproduce

mpirun -np 4 dp --pt train --skip-neighbor-stat --mpi-log=master input.json

Further Information, Files, and Links

No response

The text was updated successfully, but these errors were encountered:

njzjz · 2024-07-05T23:35:44Z

I don't think MPI training has been supported by the PyTorch backend. Please read the documentation.

njzjz · 2024-08-21T20:29:05Z

I don't think MPI training has been supported by the PyTorch backend. Please read the documentation.

To clarify, MPI is supported by the PyTorch DDP, but PyTorch needs to be compiled with MPI. Also, the nccl backend is hard-coded here.

deepmd-kit/deepmd/pt/entrypoints/main.py

Line 110 in 63e4a25

dist.init_process_group(backend="nccl")

Manyi-Yang added the bug label Jul 5, 2024

njzjz added the awaiting response label Jul 5, 2024

njzjz added Docs and removed bug awaiting response labels Aug 21, 2024

njzjz added enhancement and removed Docs labels Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DPA-2 Training is super slow because of “DEEPMD WARNING Data loading buffer is empty or nearly empty“ #3951

[BUG] DPA-2 Training is super slow because of “DEEPMD WARNING Data loading buffer is empty or nearly empty“ #3951

Manyi-Yang commented Jul 5, 2024

njzjz commented Jul 5, 2024

njzjz commented Aug 21, 2024 •

edited

Loading

[BUG] DPA-2 Training is super slow because of “DEEPMD WARNING Data loading buffer is empty or nearly empty“ #3951

[BUG] DPA-2 Training is super slow because of “DEEPMD WARNING Data loading buffer is empty or nearly empty“ #3951

Comments

Manyi-Yang commented Jul 5, 2024

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

njzjz commented Jul 5, 2024

njzjz commented Aug 21, 2024 • edited Loading

njzjz commented Aug 21, 2024 •

edited

Loading