Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

an error "Loss is nan, stopping training #1

Open
xiaopanchen opened this issue Oct 17, 2024 · 0 comments
Open

an error "Loss is nan, stopping training #1

xiaopanchen opened this issue Oct 17, 2024 · 0 comments

Comments

@xiaopanchen
Copy link

When I run the following script, CUDA_VISIBLE_DEVICES=0 python train.py --cfg configs/cuhk_sysu.yaml INPUT.BATCH_SIZE_TRAIN 2 SOLVER.BASE_LR 0.0012 SOLVER.MAX_EPOCHS 20 SOLVER.LR_DECAY_MILESTONES [11] MODEL.LOSS.USE_SOFTMAX True SOLVER.LW_RCNN_SOFTMAX_2ND 0.1 SOLVER.LW_RCNN_SOFTMAX_3RD 0.1 OUTPUT_DIR ./logs/cuhk-sysu
then there is an error "Loss is nan, stopping training". How we can do to solve this problem? Thk.

Start training...
Epoch: [0] [ 0/5603] eta: 11:33:02 lr: 0.000001 loss: 15.3648 (15.3648) loss_rcnn_cls_1st: 0.7488 (0.7488) loss_rcnn_reg_1st: 1.0251 (1.0251) loss_rcnn_cls_2nd: 0.8071 (0.8071) loss_rcnn_reg_2nd: 0.1901 (0.1901) loss_rcnn_cls_3rd: 0.8165 (0.8165) loss_rcnn_reg_3rd: 0.0001 (0.0001) loss_rcnn_reid_2nd: 4.6311 (4.6311) loss_rcnn_reid_3rd: 4.6311 (4.6311) loss_rpn_reg: 0.1002 (0.1002) loss_rpn_cls: 0.6908 (0.6908) loss_box_softmax_2nd: 0.8609 (0.8609) loss_box_softmax_3rd: 0.8630 (0.8630) time: 7.4214 data: 6.1422 max mem: 12317
Loss is nan, stopping training
{'loss_rcnn_cls_1st': tensor(0.6996, device='cuda:0', grad_fn=), 'loss_rcnn_reg_1st': tensor(0.9864, device='cuda:0', grad_fn=), 'loss_rcnn_cls_2nd': tensor(0.8071, device='cuda:0', grad_fn=), 'loss_rcnn_reg_2nd': tensor(0.1589, device='cuda:0', grad_fn=), 'loss_rcnn_cls_3rd': tensor(0.8237, device='cuda:0', grad_fn=), 'loss_rcnn_reg_3rd': tensor(3.1079e-05, device='cuda:0', grad_fn=), 'loss_rcnn_reid_2nd': tensor(nan, device='cuda:0', grad_fn=), 'loss_rcnn_reid_3rd': tensor(nan, device='cuda:0', grad_fn=), 'loss_rpn_reg': tensor(0.0262, device='cuda:0', grad_fn=), 'loss_rpn_cls': tensor(0.6909, device='cuda:0', grad_fn=), 'loss_box_softmax_2nd': tensor(nan, device='cuda:0', grad_fn=), 'loss_box_softmax_3rd': tensor(nan, device='cuda:0', grad_fn=)}

Process finished with exit code 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant