You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When run DPA-2 (dp_tf) train calculation with Deepkit-V3.0.0a0, we always got following WARNINGS:
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: trn: rmse = 9.27e+00, rmse_e = 5.18e-02, rmse_f = 3.20e-01, lr = 1.68e-04
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: val: rmse = 2.10e+01, rmse_e = 2.09e-01, rmse_f = 7.22e-01
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: total wall time = 867.78 s
[2024-07-05 11:04:32,527] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: trn: rmse = 8.23e+00, rmse_e = 1.09e-01, rmse_f = 2.84e-01, lr = 1.68e-04
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: val: rmse = 8.26e+00, rmse_e = 2.24e-01, rmse_f = 2.84e-01
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: total wall time = 948.59 s
This means that loading the data takes up a lot of time during training, which makes my training very slow.
I would like to know whether we have solutions to fix this problem.
By the way: keywords "stat_file": "./dpa2 was used.
DeePMD-kit Version
V3.0.0a0
Backend and its version
deepmd-kit.3.0_cuda123/lib/python3.11
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: trn: rmse = 9.27e+00, rmse_e = 5.18e-02, rmse_f = 3.20e-01, lr = 1.68e-04
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: val: rmse = 2.10e+01, rmse_e = 2.09e-01, rmse_f = 7.22e-01
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: total wall time = 867.78 s
[2024-07-05 11:04:32,527] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: trn: rmse = 8.23e+00, rmse_e = 1.09e-01, rmse_f = 2.84e-01, lr = 1.68e-04
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: val: rmse = 8.26e+00, rmse_e = 2.24e-01, rmse_f = 2.84e-01
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: total wall time = 948.59 s
[2024-07-05 11:19:32,806] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:30:11,391] DEEPMD INFO batch 24000: trn: rmse = 1.21e+01, rmse_e = 1.30e-01, rmse_f = 4.21e-01, lr = 1.65e-04
[2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: val: rmse = 9.12e+00, rmse_e = 1.42e-01, rmse_f = 3.18e-01
[2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: total wall time = 863.67 s { "_comment": "that's all", "model": { "type_map": [ "H", "C", "N", "O" ], "descriptor": { "type": "dpa2", "tebd_dim": 8, "repinit_rcut": 7.0, "repinit_rcut_smth": 6.0, "repinit_nsel": 100, "repformer_rcut": 4.0, "repformer_rcut_smth": 3.5, "repformer_nsel": 40, "repinit_neuron": [ 25, 50, 100 ], "repinit_axis_neuron": 12, "repinit_activation": "tanh", "repformer_nlayers": 12, "repformer_g1_dim": 128, "repformer_g2_dim": 32, "repformer_attn2_hidden": 32, "repformer_attn2_nhead": 4, "repformer_attn1_hidden": 128, "repformer_attn1_nhead": 4, "repformer_axis_dim": 4, "repformer_update_h2": false, "repformer_update_g1_has_conv": true, "repformer_update_g1_has_grrg": true, "repformer_update_g1_has_drrd": true, "repformer_update_g1_has_attn": true, "repformer_update_g2_has_g1g1": true, "repformer_update_g2_has_attn": true, "repformer_attn2_has_gate": true, "repformer_add_type_ebd_to_seq": false },
Bug summary
When run DPA-2 (dp_tf) train calculation with Deepkit-V3.0.0a0, we always got following WARNINGS:
This means that loading the data takes up a lot of time during training, which makes my training very slow.
I would like to know whether we have solutions to fix this problem.
By the way: keywords
"stat_file": "./dpa2
was used.DeePMD-kit Version
V3.0.0a0
Backend and its version
deepmd-kit.3.0_cuda123/lib/python3.11
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: trn: rmse = 9.27e+00, rmse_e = 5.18e-02, rmse_f = 3.20e-01, lr = 1.68e-04
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: val: rmse = 2.10e+01, rmse_e = 2.09e-01, rmse_f = 7.22e-01
[2024-07-05 10:59:59,135] DEEPMD INFO batch 20000: total wall time = 867.78 s
[2024-07-05 11:04:32,527] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: trn: rmse = 8.23e+00, rmse_e = 1.09e-01, rmse_f = 2.84e-01, lr = 1.68e-04
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: val: rmse = 8.26e+00, rmse_e = 2.24e-01, rmse_f = 2.84e-01
[2024-07-05 11:15:47,722] DEEPMD INFO batch 22000: total wall time = 948.59 s
[2024-07-05 11:19:32,806] DEEPMD WARNING Data loading buffer is empty or nearly empty. This may indicate a data loading bottleneck, and increasing the number of workers (--num-workers) may help.
[2024-07-05 11:30:11,391] DEEPMD INFO batch 24000: trn: rmse = 1.21e+01, rmse_e = 1.30e-01, rmse_f = 4.21e-01, lr = 1.65e-04
[2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: val: rmse = 9.12e+00, rmse_e = 1.42e-01, rmse_f = 3.18e-01
[2024-07-05 11:30:11,392] DEEPMD INFO batch 24000: total wall time = 863.67 s
{ "_comment": "that's all", "model": { "type_map": [ "H", "C", "N", "O" ], "descriptor": { "type": "dpa2", "tebd_dim": 8, "repinit_rcut": 7.0, "repinit_rcut_smth": 6.0, "repinit_nsel": 100, "repformer_rcut": 4.0, "repformer_rcut_smth": 3.5, "repformer_nsel": 40, "repinit_neuron": [ 25, 50, 100 ], "repinit_axis_neuron": 12, "repinit_activation": "tanh", "repformer_nlayers": 12, "repformer_g1_dim": 128, "repformer_g2_dim": 32, "repformer_attn2_hidden": 32, "repformer_attn2_nhead": 4, "repformer_attn1_hidden": 128, "repformer_attn1_nhead": 4, "repformer_axis_dim": 4, "repformer_update_h2": false, "repformer_update_g1_has_conv": true, "repformer_update_g1_has_grrg": true, "repformer_update_g1_has_drrd": true, "repformer_update_g1_has_attn": true, "repformer_update_g2_has_g1g1": true, "repformer_update_g2_has_attn": true, "repformer_attn2_has_gate": true, "repformer_add_type_ebd_to_seq": false },
Steps to Reproduce
mpirun -np 4 dp --pt train --skip-neighbor-stat --mpi-log=master input.json
Further Information, Files, and Links
No response
The text was updated successfully, but these errors were encountered: