Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The training and validation loss become nan #166

Open
BlackRab opened this issue Oct 23, 2024 · 8 comments
Open

The training and validation loss become nan #166

BlackRab opened this issue Oct 23, 2024 · 8 comments

Comments

@BlackRab
Copy link

Hello, everyone! Recently, I used OpenSTL to train a PredRNN++ model, but after several epochs, the training and validation loss became nan. Before training the model, I normalized the training data to between 0-1. Why does this problem occur and how should this problem be solved? Thanks!!!

The following are the model parameters:

device: 	cuda	
dist: 	False	
res_dir: 	work_dirs	
ex_name: 	custom_exp	
fp16: 	False	
torchscript: 	False	
seed: 	42	
fps: 	False	
test: 	False	
deterministic: 	False	
batch_size: 	24	
val_batch_size: 	24	
num_workers: 	4	
data_root: 	./data	
dataname: 	custom	
pre_seq_length: 	24	
aft_seq_length: 	24	
total_length: 	48	
use_augment: 	False	
use_prefetcher: 	False	
drop_last: 	False	
method: 	predrnnpp	
config_file: 	None	
model_type: 	gSTA	
drop: 	0.0	
drop_path: 	0.0	
overwrite: 	False	
epoch: 	50	
log_step: 	1	
opt: 	adam	
opt_eps: 	None	
opt_betas: 	None	
momentum: 	0.9	
weight_decay: 	0.0	
clip_grad: 	None	
clip_mode: 	norm	
no_display_method_info: 	False	
sched: 	cosine	
lr: 	0.0001	
lr_k_decay: 	1.0	
warmup_lr: 	1e-05	
min_lr: 	1e-06	
final_div_factor: 	10000.0	
warmup_epoch: 	0	
decay_epoch: 	100	
decay_rate: 	0.1	
filter_bias_and_bn: 	False	
gpus: 	[0]	
metric_for_bestckpt: 	val_loss	
ckpt_path: 	None	
metrics: 	['mse', 'mae']	
in_shape: 	[24, 2, 64, 64]	
num_hidden: 	128,128,128,128	
filter_size: 	5	
stride: 	1	
patch_size: 	2	
layer_norm: 	0	
reverse_scheduled_sampling: 	0	
r_sampling_step_1: 	25000	
r_sampling_step_2: 	50000	
r_exp_alpha: 	5000	
scheduled_sampling: 	1	
sampling_stop_iter: 	50000	
sampling_start_value: 	1.0	
sampling_changing_rate: 	2e-05

The following is the output when I train the model:

Epoch 1: Lr: 0.0000999 | Train Loss: 0.0381869 | Vali Loss: 0.0993886
Epoch 2: Lr: 0.0000996 | Train Loss: 0.0277655 | Vali Loss: 0.0875488
Epoch 3: Lr: 0.0000991 | Train Loss: 0.0258355 | Vali Loss: 0.1026545
Epoch 4: Lr: 0.0000984 | Train Loss: 0.0249112 | Vali Loss: 0.0780292
Epoch 5: Lr: 0.0000976 | Train Loss: 0.0240881 | Vali Loss: 0.0948234
Epoch 6: Lr: 0.0000965 | Train Loss: 0.0235908 | Vali Loss: 0.1095217
Epoch 7: Lr: 0.0000953 | Train Loss: 0.0232413 | Vali Loss: 0.1180097
Epoch 8: Lr: 0.0000939 | Train Loss: 0.0230120 | Vali Loss: 0.0807080
Epoch 9: Lr: 0.0000923 | Train Loss: 0.0228280 | Vali Loss: 0.0957872
Epoch 10: Lr: 0.0000905 | Train Loss: 0.0226669 | Vali Loss: 0.0887136
Epoch 11: Lr: 0.0000886 | Train Loss: 0.0225349 | Vali Loss: 0.0886962
Epoch 12: Lr: 0.0000866 | Train Loss: 0.0244125 | Vali Loss: 0.0808648
Epoch 13: Lr: 0.0000844 | Train Loss: 0.0230432 | Vali Loss: 0.1242904
Epoch 14: Lr: 0.0000821 | Train Loss: 0.0226797 | Vali Loss: 0.1319531
Epoch 15: Lr: 0.0000796 | Train Loss: 0.0225613 | Vali Loss: 0.0810021
Epoch 16: Lr: 0.0000770 | Train Loss: 0.0224811 | Vali Loss: 0.0879602
Epoch 17: Lr: 0.0000743 | Train Loss: 0.0224050 | Vali Loss: 0.1517410
Epoch 18: Lr: 0.0000716 | Train Loss: 0.0223686 | Vali Loss: 0.0887038
Epoch 19: Lr: 0.0000687 | Train Loss: 0.0223244 | Vali Loss: 0.0862220
Epoch 20: Lr: 0.0000658 | Train Loss: 0.0222688 | Vali Loss: 0.0773778
Epoch 21: Lr: 0.0000628 | Train Loss: 0.0222274 | Vali Loss: 0.0896754
Epoch 22: Lr: 0.0000598 | Train Loss: 0.0221916 | Vali Loss: 0.0806330
Epoch 23: Lr: 0.0000567 | Train Loss: 0.0221258 | Vali Loss: 0.0930094
Epoch 24: Lr: 0.0000536 | Train Loss: 0.0221751 | Vali Loss: 0.0749368
Epoch 25: Lr: 0.0000505 | Train Loss: 0.0220574 | Vali Loss: 0.0836148
Epoch 26: Lr: 0.0000474 | Train Loss: 0.0219867 | Vali Loss: nan
Epoch 27: Lr: 0.0000443 | Train Loss: nan | Vali Loss: nan
Epoch 28: Lr: 0.0000412 | Train Loss: nan | Vali Loss: nan
Epoch 29: Lr: 0.0000382 | Train Loss: nan | Vali Loss: nan
@mathDR
Copy link

mathDR commented Oct 23, 2024

What are your gradient values for the few steps before the nans occur?

@BlackRab
Copy link
Author

Can you tell me how to output the gradient values when using the OpenSTL? I don't know how to output it. Thanks!

What are your gradient values for the few steps before the nans occur?

@mathDR
Copy link

mathDR commented Oct 23, 2024

Can you tell me how to output the gradient values when using the OpenSTL? I don't know how to output it. Thanks!

What are your gradient values for the few steps before the nans occur?

How are writing the above output? Is that the default (apologies I do not have the code open in front of me).

I would search for where these print statements are happening in the code and amend them to include printing the value of the (norm) of the gradient...My conjecture (given your loss isn't really decreasing anymore) is that you are in an area of flat geometry and the gradient is encountering troubles...

@BlackRab
Copy link
Author

May I ask what should be done to solve this problem in this case? Like tweaking the training data, reducing some training samples or something like that?

@mathDR
Copy link

mathDR commented Oct 23, 2024

May I ask what should be done to solve this problem in this case? Like tweaking the training data, reducing some training samples or something like that?

You can either change your learning rate schedule to more rapidly decrease lr or given your loss, it looks like your model has "converged" at Epoch 25. Do you have any prior belief that the loss should be less than 0.0220574?

But your learning rate is 5e-4 which might be too big to get better estimates. So I would definitely do a more aggressive scheduler.

@BlackRab
Copy link
Author

Thanks! I'll set a smaller lr to train the model.

Although nan appeared during training, I made predictions with the best model saved, but the results were not very good. Because I trained another PredRNN++ model before, the only difference is that the training data is less. The result of the newly trained model is not better than the previous one, so I think the model is not yet optimal.

@mathDR
Copy link

mathDR commented Oct 23, 2024

Thanks! I'll set a smaller lr to train the model.

Although nan appeared during training, I made predictions with the best model saved, but the results were not very good. Because I trained another PredRNN++ model before, the only difference is that the training data is less. The result of the newly trained model is not better than the previous one, so I think the model is not yet optimal.

"The results were not very good." what does this mean? Is there something that you are doing to discern this that isn't being represented in your model? Like I asked above: what makes you think the loss will go less than 0.022?

@BlackRab
Copy link
Author

"The results were not very good." what does this mean? Is there something that you are doing to discern this that isn't being represented in your model? Like I asked above: what makes you think the loss will go less than 0.022?

Honestly, I have no strict basis for thinking that the LOSS should be lower than 0.022.

My previous model trained with a dataset with less data was better in predicting results than this current model that 'converged' at epoch 25. I think this model with more training data should perform better, so I don't think the model trained now is optimal. But, it is true that the optimal model will not necessarily have a loss lower than 0.022.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants